91

I can't figure out how to do a Two-sample KS test in Scipy.

After reading the documentation scipy kstest

I can see how to test where a distribution is identical to standard normal distribution

from scipy.stats import kstest
import numpy as np

x = np.random.normal(0,1,1000)
test_stat = kstest(x, 'norm')
#>>> test_stat
#(0.021080234718821145, 0.76584491300591395)

Which means that at p-value of 0.76 we can not reject the null hypothesis that the two distributions are identical.

However, I want to compare two distributions and see if I can reject the null hypothesis that they are identical, something like:

from scipy.stats import kstest
import numpy as np

x = np.random.normal(0,1,1000)
z = np.random.normal(1.1,0.9, 1000)

and test whether x and z are identical

I tried the naive:

test_stat = kstest(x, z)

and got the following error:

TypeError: 'numpy.ndarray' object is not callable

Is there a way to do a two-sample KS test in Python? If so, how should I do it?

Thank You in Advance

denfromufa
  • 4,995
  • 11
  • 66
  • 130
Akavall
  • 68,050
  • 39
  • 179
  • 227

2 Answers2

132

You are using the one-sample KS test. You probably want the two-sample test ks_2samp:

>>> from scipy.stats import ks_2samp
>>> import numpy as np
>>> 
>>> np.random.seed(12345678)
>>> x = np.random.normal(0, 1, 1000)
>>> y = np.random.normal(0, 1, 1000)
>>> z = np.random.normal(1.1, 0.9, 1000)
>>> 
>>> ks_2samp(x, y)
Ks_2sampResult(statistic=0.022999999999999909, pvalue=0.95189016804849647)
>>> ks_2samp(x, z)
Ks_2sampResult(statistic=0.41800000000000004, pvalue=3.7081494119242173e-77)

Results can be interpreted as following:

  1. You can either compare the statistic value given by python to the KS-test critical value table according to your sample size. When statistic value is higher than the critical value, the two distributions are different.

  2. Or you can compare the p-value to a level of significance a, usually a=0.05 or 0.01 (you decide, the lower a is, the more significant). If p-value is lower than a, then it is very probable that the two distributions are different.

Toby Speight
  • 23,550
  • 47
  • 57
  • 84
DSM
  • 291,791
  • 56
  • 521
  • 443
  • 1
    That's exactly what I was looking for. Thank You Very Much! – Akavall Jun 04 '12 at 16:35
  • 2
    How do you interpret these results? Can you say the samples come from the same distribution just by looking at `statistic` and `p-value`? – FaCoffee Feb 24 '17 at 10:40
  • 4
    @FaCoffee This is what the scipy docs say: "_If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same._" – user2738815 Mar 18 '17 at 08:29
6

This is what the scipy docs say:

If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same.

Cannot reject doesn't mean we confirm.

piet.t
  • 11,035
  • 20
  • 40
  • 49
jun 小嘴兔
  • 61
  • 1
  • 2
  • could you explain your answer in further detail? thanks in advance! – King Reload May 02 '17 at 08:19
  • @KingReload It means when the *p* value is very small, that says the probability of these two samples *Not* coming from the same distribution is very low. In another word, the probability of these two sample coming from same distribution is very high. But you can not be 100% sure about that hence *p* values are never zero. (Sometimes they show as 0, but actually, it's never zero). That's why it is said that *We failed to reject the null hypothesis* instead of *We are accepting the null hypothesis*. Accepting null hypothesis = *distributions of the two samples are the same* – MD Abid Hasan Feb 14 '18 at 22:26
  • 3
    p-value high very likely they come from the same distribution, p-value small likely they don't. @MDAbidHasan has it backwards. Indeed, the example in the documentation they give an example: ```For an identical distribution, we cannot reject the null hypothesis since the p-value is high, 41%: >>> >>> rvs4 = stats.norm.rvs(size=n2, loc=0.0, scale=1.0) >>> stats.ks_2samp(rvs1, rvs4) (0.07999999999999996, 0.41126949729859719)``` – superhero Feb 23 '18 at 17:35