1

Building off and earlier question: Computing stats on generators in single pass. Python

As I mentioned before computing statistics from a generator in a single pass is extremely fast and memory efficient. Complex statistics and rank attributes like the 90th percentile and the nth smallest often need more complex work than standard deviation and averages (solved in the above). These approaches become very important when working with map/reduce jobs and large datasets where putting the data into a list or computing multiple passes becomes very slow.

The following is an O(n) quicksort style algorithm for looking up data based on rank order. Useful for finding medians, percentiles, quartiles, and deciles. Equivalent to data[n] when the data is already sorted. But needs all the data in a list that can be split/pivoted.

How can you compute medians, percentiles, quartiles, and deciles with a generator on a single pass?

The Quicksort style algorithm that needs a complete list

import random

def select(data, n):
    "Find the nth rank ordered element (the least value has rank 0)."
    data = list(data)
    if not 0 <= n < len(data):
        raise ValueError('not enough elements for the given rank')
    while True:
        pivot = random.choice(data)
        pcount = 0
        under, over = [], []
        uappend, oappend = under.append, over.append
        for elem in data:
            if elem < pivot:
                uappend(elem)
            elif elem > pivot:
                oappend(elem)
            else:
                pcount += 1
        if n < len(under):
            data = under
        elif n < len(under) + pcount:
            return pivot
        else:
            data = over
            n -= len(under) + pcount
Community
  • 1
  • 1
Matt Alcock
  • 10,631
  • 13
  • 41
  • 59
  • What do you mean by "with a generator"? You mean an online quantile selection algorithm? What are your memory constraints? P.S. the "Quicksort style" algorithm is known as QuickSelect, because it selects the kth element in a QuickSort style. – Has QUIT--Anony-Mousse Jul 04 '12 at 16:03
  • A generator is python term for collection you can pass through once to collect the data. Yes I mean an online quantile selection algorithm. Thanks re QuickSelect. – Matt Alcock Jul 04 '12 at 16:07
  • You didn't answer the memory constraints question yet. This is essential, because the element you are looking for could have been the first one, so you potentially need to memorize the complete stream (unless you know a bound on the stream size, that is) – Has QUIT--Anony-Mousse Jul 04 '12 at 16:10
  • Do you want to compute *exact* answers or are you happy with approximations? – Chris Taylor Jul 04 '12 at 16:36
  • 1
    related: http://stackoverflow.com/q/1058813/4279 – jfs Jul 04 '12 at 17:01
  • You mention map/reduce in your question, but you tagged the question with python. I am going out on a limb here, but perhaps you are using python to write custom mappers and reducers? – Samsdram Jul 10 '12 at 03:11

1 Answers1

4

You will need to store large parts of the data. Up to the point where it may just pay off to store it completely. Unless you are willing to accept an approximate algorithm (which may be very reasonable when you know your data is independent).

Consider you need to find the median of the following data set:

0  1  2  3  4  5  6  7  8  9 -1 -2 -3 -4 -5 -6 -7 -8 -9

The median is obviously 0. However, if you have seen only the first 10 elements, it is your worst guess at that time! So in order to find the median of an n element stream, you need to keep at least n/2 candidate elements in memory. And if you do not know the total size n, you need to keep all!

Here are the medians for every odd-sized situation:

0  _  1  _  2  _  3  _  4  _  4  _  3  _  2  _  1  _  0

While they were never candidates, you also need to remember the element 5 - 9:

0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18

yields the median 9. For every element in a series of size n I can find a continued series of size O(2*n) that has this element as median. But obviously, these series are not random / independent.

See "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis? for an overview of related methods.

Community
  • 1
  • 1
Has QUIT--Anony-Mousse
  • 70,714
  • 12
  • 123
  • 184