Here, we discuss the Kolmogorov-Smirnov test in R, including its interpretation, comparing one sample to a distribution, and comparing two samples.

The Kolmogorov-Smirnov test in R can be performed with the ks.test() function from the base "stats" package.

The Kolmogorov-Smirnov test is a non-parametric test that can be used to test whether a sample fits a distribution, or if two samples are from the same distribution.

Table of Some Kolmogorov-Smirnov (KS) Test Functions in R
Function Usage
ks.test(Sample, CDF Function) Test if sample is from a distribution
ks.test(Sample 1, Sample 2) Test if two samples are from the same distribution

1 Examples Comparing One-Sample Data to a Distribution with Interpretation

The Kolmogorov–Smirnov test statistic, \(D_n \in (0,1)\), for any cumulative distribution function \(F(x)\) is: \[D_n= \sup_x |F_n(x)-F(x)|,\] where \(F_n\) is the empirical cumulative distribution function of the sample data with \(n\) observations. This implies \(D_n\) is the maximum absolute difference between the two cumulative distribution functions for all \(x\) values.

Therefore, the higher the value of \(D_n\), the more different the distributions are, leading to a smaller p-value.

The one-sample Kolmogorov-Smirnov test null and alternate hypotheses are:

  • \(H_0\): The sample is from the distribution \(F(x)\).

  • \(H_1\): The sample is NOT from the distribution \(F(x)\).

Example 1:

With level of significance \(\alpha = 0.05\), test if a sample of \(30\) observations from the normal distribution with \(\tt{mean = 10}\) and \(\tt{sd = 2}\) is from the normal distribution with \(\tt{mean = 9}\) and \(\tt{sd = 1.5}\).

sample = rnorm(30, 10, 2)
ks.test(sample, pnorm, 9, 1.5)

    Exact one-sample Kolmogorov-Smirnov test

data:  sample
D = 0.27658, p-value = 0.01608
alternative hypothesis: two-sided

The high \(D_n\), and \(\tt{p-value}\) below the level of significance \((\alpha = 0.05)\), are due to the sample being from a different distribution. Hence, given the \(\tt{p-value}\;(0.01608)\) is less than \(\alpha = 0.05\), we reject \(H_0\) that the sample is from \(X \sim N(9, 1.5)\).

See also the normal distribution tests.

Example 2:

With level of significance \(\alpha = 0.05\), test if a sample of \(50\) observations from the Student’s t-distribution with \(\tt{degree \; of \; freedom = 12}\) is from a Student’s t-distribution with \(\tt{degree \; of \; freedom = 12}\).

sample = rt(50, 12)
ks.test(sample, pt, 12)

    Exact one-sample Kolmogorov-Smirnov test

data:  sample
D = 0.087856, p-value = 0.8029
alternative hypothesis: two-sided

The low \(D_n\), and \(\tt{p-value}\) above the level of significance \((\alpha = 0.05)\), are due to the sample being from the distribution. Hence, given the \(\tt{p-value}\;(0.8029)\) is greater than \(\alpha = 0.05\), we fail to reject \(H_0\) that the sample is from \(X \sim t_{12}\).

2 Examples Comparing Two-Sample Data with Interpretation

For testing whether two samples have different underlying probability distributions, the test statistic, \(D_{n,m} \in (0,1)\), is: \[D_{n,m}=\sup_x |F_{1,n}(x)-F_{2,m}(x)|,\]

where \(F_{1,n}\) and \(F_{2,m}\) are the empirical cumulative distribution functions of Sample 1 and Sample 2, with \(n\) and \(m\) observations, respectively.

Therefore, the higher the value of \(D_{n,m}\), the more different the distributions are leading to a smaller p-value.

The two-sample Kolmogorov-Smirnov test null and alternate hypotheses are:

  • \(H_0\): The two samples are from the same distribution.

  • \(H_1\): The two samples are NOT from the same distribution.

Example 1:

With level of significance \(\alpha = 0.05\), test if two samples, one of with \(40\) observations, and another with \(35\) observations, both from the exponential distribution with \(\tt{rate = 0.2}\) are from the same distribution.

sample1 = rexp(40, 0.2)
sample2 = rexp(35, 0.2)
ks.test(sample1, sample2)

    Exact two-sample Kolmogorov-Smirnov test

data:  sample1 and sample2
D = 0.15, p-value = 0.7319
alternative hypothesis: two-sided

The low \(D_n\), and \(\tt{p-value}\) above the level of significance \((\alpha = 0.05)\), are due to the samples being from the same distribution. Hence, given the \(\tt{p-value\;(0.7319)}\) is greater than \(\alpha = 0.05\), we fail to reject \(H_0\) that the two samples are from the same distribution.

Example 2:

With level of significance \(\alpha = 0.05\), test if a sample with \(50\) observations from the binomial distribution with \(\tt{size = 10}\) and \(\tt{prob = 0.8}\) is from the same distribution as another sample with \(50\) observations from the Poisson distribution with \(\tt{mean = 6}\).

sample1 = rbinom(50, 10, 0.8)
sample2 = rpois(50, 6)
ks.test(sample1, sample2)

    Exact two-sample Kolmogorov-Smirnov test

data:  sample1 and sample2
D = 0.48, p-value = 4.041e-06
alternative hypothesis: two-sided

The high \(D_n\), and \(\tt{p-value}\) below the level of significance \((\alpha = 0.05)\), are due to the samples being from different distributions. Hence, given the \(\tt{p-value\;(4.041e-06)}\) is less than \(\alpha = 0.05\), we reject \(H_0\) that the two samples are from the same distribution.

3 Example of One-sided Test

For the test of whether the CDF of one distribution lies above another, you can use the Kolmogorov-Smirnov test in R, with the argument "alternate" set to "less" (for Sample 1 below Sample 2) or "greater" (for Sample 1 above Sample 2).

The one-sided Kolmogorov-Smirnov test null and alternate hypotheses are:

  • \(H_0\): The CDF of Sample 1 and the CDF of Sample 2 are identical.

  • \(H_1\): The CDF of Sample 1 lies above (or below) the CDF of Sample 2.

Example:

With level of significance \(\alpha = 0.05\), test if the CDF of a sample with \(75\) observations from the uniform distribution with \(\tt{min = 0}\) and \(\tt{max = 1}\) lies above the CDF of another sample with \(80\) observations from the uniform distribution with \(\tt{min = 0.2}\) and \(\tt{max = 1.2}\).

sample1 = runif(75, 0, 1)
sample2 = runif(80, 0.2, 1.2)
ks.test(sample1, sample2, alternative = "greater")

    Exact two-sample Kolmogorov-Smirnov test

data:  sample1 and sample2
D^+ = 0.25583, p-value = 0.004928
alternative hypothesis: the CDF of x lies above that of y

The high \(D_n\), and \(\tt{p-value}\) below the level of significance \((\alpha = 0.05)\), are due to Sample 1 coming from a distribution with typically smaller observations than those of the distribution Sample 2 comes from, which will cause the CDF of Sample 1 to lie above the CDF of Sample 2. Hence, given the \(\tt{p-value\;(0.004928)}\) is less than \(\alpha = 0.05\), we reject \(H_0\) in favor of \(H_1\), that the CDF of Sample 1 lies above the CDF of Sample 2.

Copyright © 2020 - 2024. All Rights Reserved by Stats Codes