Building off and earlier question: Computing stats on generators in single pass. Python
As I mentioned before computing statistics from a generator in a single pass is extremely fast and memory efficient. Complex statistics and rank attributes like the 90th percentile and the nth smallest often need more complex work than standard deviation and averages (solved in the above). These approaches become very important when working with map/reduce jobs and large datasets where putting the data into a list or computing multiple passes becomes very slow.
The following is an O(n) quicksort style algorithm for looking up data based on rank order. Useful for finding medians, percentiles, quartiles, and deciles. Equivalent to data[n] when the data is already sorted. But needs all the data in a list that can be split/pivoted.
How can you compute medians, percentiles, quartiles, and deciles with a generator on a single pass?
The Quicksort style algorithm that needs a complete list
import random
def select(data, n):
"Find the nth rank ordered element (the least value has rank 0)."
data = list(data)
if not 0 <= n < len(data):
raise ValueError('not enough elements for the given rank')
while True:
pivot = random.choice(data)
pcount = 0
under, over = [], []
uappend, oappend = under.append, over.append
for elem in data:
if elem < pivot:
uappend(elem)
elif elem > pivot:
oappend(elem)
else:
pcount += 1
if n < len(under):
data = under
elif n < len(under) + pcount:
return pivot
else:
data = over
n -= len(under) + pcount