How to calculate the largest distance between two cumulative sample distributions in Python?

Question

Assume there are two 1D Numpy array samples with the same length, X1 and X2. After converting each of the two samples separately into accumulative density distribution, how to calculate the largest distance between the two cumulative sample distributions? After the code below, what should I do?

import numpy as np
def function(X1, X2):
    x1 = np.sort(X1)
    y1 = np.arange(1, len(x1)+1) / float(len(x1))
    x2 = np.sort(X2)
    y2 = np.arange(1, len(x2)+1) / float(len(x2))

score 0 · Answer 1 · answered Nov 15 '20 at 17:16

From your kolomogorov-smirnov tag I gather that the function you are looking for is from scipy, see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html.

One of it's input modes is two sample vectors. This makes it even easier than what you started to implement. Just use it directly as these examples:

from scipy.stats import kstest
import numpy as np
samps1 = np.random.normal(size=100)
samps2 = np.random.normal(size=100)
samps3 = np.random.normal(loc=1, size=100)
kstest(samps1, samps2)
>>> KstestResult(statistic=0.15, pvalue=0.21117008625127576)
kstest(samps2, samps1)
>>> KstestResult(statistic=0.15, pvalue=0.21117008625127576)
kstest(samps1, samps3)
>>> KstestResult(statistic=0.29, pvalue=0.0004117410017938115)
kstest(samps2, samps1).statistic
>>> 0.15

Note that the function returns both the statistic and the p_value, so you need to access .statistic directly after calling the function.

How to calculate the largest distance between two cumulative sample distributions in Python?

1 Answers1