I've read through various variants of this question and their associated answers, but having no luck gaining a sense of how to address my particular problem. I believe an answer to this would also be useful to other individuals.
I'm trying to define a conceptual approach to calculating the median of a series of numbers in a single field using a Python mapper(s) and reducer(s) within a Hadoop streaming framework.
Say we have a csv with 20 fields and four million rows. How would we calculate the median of a field, let's call it number
, that holds a value (e.g. 307, 212, 719, 2123, 77, 398
, etc.)?
I know a few ways to do this using pure Python and Pandas, but they don't translate within a Hadoop streaming framework. Thank you.