Problem:
Here I plot 2 datasets stored in textfiles (in list dataset
) each containing 21.8 billion data points. This makes the data too large to hold in memory as an array. I am still able to graph them as histograms, but I'm unsure how to calculate their difference via a 2 sample KS test. This is because I cannot figure out how to access each histogram in the plt object.
Example:
Here is some code to generate dummy data:
mu = [100, 120]
sigma = 30
dataset = ['gsl_test_1.txt', 'gsl_test_2.txt']
for idx, file in enumerate(dataset):
dist = np.random.normal(mu[idx], sigma, 10000)
with open(file, 'w') as g:
for s in dist:
g.write('{}\t{}\t{}\n'.format('stuff', 'stuff', str(s)))
This generates my two histograms (made possible here):
chunksize = 1000
dataset = ['gsl_test_1.txt', 'gsl_test_2.txt']
for fh in dataset:
# find the min, max, line qty, for bins
low = np.inf
high = -np.inf
loop = 0
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):
low = np.minimum(chunk.iloc[:, 2].min(), low)
high = np.maximum(chunk.iloc[:, 2].max(), high)
loop += 1
lines = loop*chunksize
nbins = math.ceil(math.sqrt(lines))
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64) # np.ndarray filled with np.uint32 zeros, CHANGED TO int64
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):
# compute bin counts over the 3rd column
subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # np.ndarray filled with np.int64
# accumulate bin counts over chunks
total += subtotal
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
plt.savefig('gsl_test_hist.svg')
Question:
Most examples for KS-statistics employ two arrays of raw data/observations/points/etc, but I don't have enough memory to use this approach. Per the example above, how can I access these precomputed bins (from 'gsl_test_1.txt'
and 'gsl_test_2.txt'
to compute the KS statistic between the two distributions?
Bonus karma: Record the KS statistic and pvalue on the graph!