How to find the global average in a large dataset?

Question

I am writing simple mapreduce programs to find the average,smallest number and largest number present in my data(many text files).I guess using a combiner to find the desired stuff for within the numbers processed by a single mapper first would make it more efficient.

However I am concerned about the fact that, in order to be able to find the average, smallest number or largest number we would require the data from all mappers(and hence all combiners) to go to a single reducer, so that we can find universal average, smallest number or largest number .Which in case of larger data sets would be a huge bottleneck.

I am sure there would be some way out to handle this issue in hadoop that I probably can not think of.Can someone please guide me.I have been asked this sort of questions in couple of interviews as well.

Also while running my 'Find Average' mapreduce program I am facing an issue, the only running mapper is taking too long to complete.I have increased the map task time-out as well but it still gets stuck.Whereas with the help of stdout logs I have found that my mapper and combiner are executed smoothly.Hence I am not able to figure out what is causing my mapreduce job to hang.

it would help if you posted the Find Average code that you are running, regarding your last question. Also, note that the average of averages is NOT the global average (i.e., average is not associative) — vefthym, Jul 23 '15 at 10:01

score 0 · Answer 1 · edited May 23 '17 at 10:28

0

Averages can be calculated on a stream of data. Try holding on to the following:

Current average
Number of elements

This way you'll know how much weight to give to an incoming number as well as a batch of numbers.

Here are a few solutions:

edited May 23 '17 at 10:28

Community

1
1

answered Jul 23 '15 at 04:35

srem

1

score 0 · Answer 2 · edited May 23 '17 at 12:02

For average, use a single reducer, emitting the same key for all pairs and the values, for which you want to find the average, as value, like that (without a combiner, since average is not associative, i.e., the average of averages is not the global average). Example:

values in Mapper 1: 1, 2, 3
values in Mapper 2: 5, 10

The average of the values of Mapper 1 is 2 = (1+2+3)/3.
The average of the values of Mapper 2 is 7.5 = (5+10)/2.
The average of the averages is 4.75 = (2+7.5)/2.
The global average is 4.2 = (1+2+3+5+10)/5.

For a more detailed answer, including a tricky solution with a combiner, see my slides (starting from slide 7), inspired from Donald Miner's book "MapReduce Design Patterns".

For the min/max, do the following logic:

Again, you can use a single reducer, with all the mappers emitting the same key always and the value being each of the values that you want to find the min/max.

A combiner (which is the same as the reducer) receives a list of values and emits the local min/max. Then, the single reducer, receives a list of local mins/maxs and emits the global min/max (min and max ARE associative).

In pseudocode:

map (key, value):
emit (1, value);

reduce(key, list<values>): //same are combiner
min = first_value;
for each value
    if value <= min
        min = value;
emit (key, min);

Thanks for your answer.This will surely help me finding the average in case of small dataset.but this can create issues in case of huge data as having just one reducer would cause network overhead.We can't afford making outputs of all the mappers to go to single reducer. And If I use multiple reducers, I am not sure how to find the global average. — Amy2477, Jul 27 '15 at 07:08

score 0 · Answer 3 · answered Jul 23 '15 at 11:22

From Map Output the Key as NullWrittable and value as (sum of value,count) In Reducer Split the value and count Sum the value and count individually Find the value of total sum divided by total count Output the average from reducer.

Logic 2 Create a Writable which can hold count and sum Pass this variable from map and reduce it with single reducer

How to find the global average in a large dataset?

3 Answers3