Calculating median using Hadoop streaming and Python

Question

I've read through various variants of this question and their associated answers, but having no luck gaining a sense of how to address my particular problem. I believe an answer to this would also be useful to other individuals.

I'm trying to define a conceptual approach to calculating the median of a series of numbers in a single field using a Python mapper(s) and reducer(s) within a Hadoop streaming framework.

Say we have a csv with 20 fields and four million rows. How would we calculate the median of a field, let's call it number, that holds a value (e.g. 307, 212, 719, 2123, 77, 398, etc.)?

I know a few ways to do this using pure Python and Pandas, but they don't translate within a Hadoop streaming framework. Thank you.

You may want to see [the algorithm for finding the median in a stream of numbers](http://stackoverflow.com/questions/10657503/find-running-median-from-a-stream-of-integers) to help you. The key idea is to use a heap to sort all the numbers you've seen so far and update the running median as more numbers are read in. — Akshat Mahajan, Mar 27 '16 at 18:43

Calculating median using Hadoop streaming and Python

0 Answers0