2

Problem:

Here I plot 2 datasets stored in textfiles (in list dataset) each containing 21.8 billion data points. This makes the data too large to hold in memory as an array. I am still able to graph them as histograms, but I'm unsure how to calculate their difference via a 2 sample KS test. This is because I cannot figure out how to access each histogram in the plt object.

Example:

Here is some code to generate dummy data:

mu = [100, 120]
sigma = 30
dataset = ['gsl_test_1.txt', 'gsl_test_2.txt']
for idx, file in enumerate(dataset):
    dist = np.random.normal(mu[idx], sigma, 10000)
    with open(file, 'w') as g:
        for s in dist:
            g.write('{}\t{}\t{}\n'.format('stuff', 'stuff', str(s)))

This generates my two histograms (made possible here):

chunksize = 1000
dataset = ['gsl_test_1.txt', 'gsl_test_2.txt']
for fh in dataset:
    # find the min, max, line qty, for bins
    low = np.inf
    high = -np.inf

    loop = 0
    for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):
        low = np.minimum(chunk.iloc[:, 2].min(), low)
        high = np.maximum(chunk.iloc[:, 2].max(), high)
        loop += 1
    lines = loop*chunksize

    nbins = math.ceil(math.sqrt(lines))   

    bin_edges = np.linspace(low, high, nbins + 1)
    total = np.zeros(nbins, np.int64)  # np.ndarray filled with np.uint32 zeros, CHANGED TO int64

    for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):

        # compute bin counts over the 3rd column
        subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)  # np.ndarray filled with np.int64

        # accumulate bin counts over chunks
        total += subtotal


    plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
    plt.savefig('gsl_test_hist.svg')

Question:

Most examples for KS-statistics employ two arrays of raw data/observations/points/etc, but I don't have enough memory to use this approach. Per the example above, how can I access these precomputed bins (from 'gsl_test_1.txt' and 'gsl_test_2.txt' to compute the KS statistic between the two distributions?

Bonus karma: Record the KS statistic and pvalue on the graph!

Community
  • 1
  • 1
Thomas Matthew
  • 2,321
  • 3
  • 27
  • 46

1 Answers1

1

i cleaned up your code a bit. writing to StringIO so it's more streamline than writing to a file. set the default vibe w/ seaborn instead of matplotlib to make it look more modern. the bins thresholds should be the same for both samples if you want the stat test to line up. i think if you iterate through and make the bins this way the whole thing may take way longer than it needs to. Counter could be useful b/c you'll only have to loop through once...plus you'll be able to make the same bin size. converting floats to ints since you are binning them together. from collections import Counter then C = Counter() and C[value] += 1. you'll have a dict at the end where you can make the bins from the list(C.keys()). this would be good since your data is so gnarly. also, you should see if there is a way to do chunksize with numpy instead of pandas b/c numpy is WAY faster at indexing. try a %timeit for DF.iloc[i,j] and ARRAY[i,j] and you'll see what i mean. i wrote much of it a function to try making it more modular.

import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
from io import StringIO
from scipy.stats import ks_2samp
import seaborn as sns; sns.set()

%matplotlib inline

#Added seaborn b/c it looks mo betta

mu = [100, 120]
sigma = 30

def write_random(file,mu,sigma=30):
    dist = np.random.normal(mu, sigma, 10000)
    for i,s in enumerate(dist):
        file.write('{}\t{}\t{}\n'.format("label_A-%d" % i, "label_B-%d" % i, str(s)))
    return(file)

#Writing to StringIO instead of an actual file
gs1_test_1 = write_random(StringIO(),mu=100)
gs1_test_2 = write_random(StringIO(),mu=120)

chunksize = 1000

def make_hist(fh,ax):
    # find the min, max, line qty, for bins
    low = np.inf
    high = -np.inf

    loop = 0

    fh.seek(0)
    for chunk in pd.read_table(fh, header=None, chunksize=chunksize, sep='\t'):
        low = np.minimum(chunk.iloc[:, 2].min(), low) #btw, iloc is way slower than numpy array indexing
        high = np.maximum(chunk.iloc[:, 2].max(), high) #you might wanna import and do the chunks with numpy
        loop += 1
    lines = loop*chunksize

    nbins = math.ceil(math.sqrt(lines))   

    bin_edges = np.linspace(low, high, nbins + 1)
    total = np.zeros(nbins, np.int64)  # np.ndarray filled with np.uint32 zeros, CHANGED TO int64

    fh.seek(0)
    for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):

        # compute bin counts over the 3rd column
        subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)  # np.ndarray filled with np.int64

        # accumulate bin counts over chunks
        total += subtotal

    plt.hist(bin_edges[:-1], bins=bin_edges, weights=total,axes=ax,alpha=0.5)

    return(ax,bin_edges,total)

#Make the plot canvas to write on to give it to the function
fig,ax = plt.subplots()

test_1_data = make_hist(gs1_test_1,ax)
test_2_data = make_hist(gs1_test_2,ax)

#test_1_data[1] == test_2_data[1] The bins should be the same if you're going try and compare them...
ax.set_title("ks: %f, p_in_the_v: %f" % ks_2samp(test_1_data[2], test_2_data[2]))

enter image description here

O.rka
  • 24,289
  • 52
  • 152
  • 253
  • 2
    I believe the reported KS-statistic here is wrong. `ks_2sample` works on raw samples, and you are passing the precomputed histograms, so the test actually operates on the frequencies on each bin. The OP's question asks specifically how to run K-S on precomputed histograms. – nroulet Oct 27 '18 at 02:54