I am writing simple mapreduce programs to find the average,smallest number and largest number present in my data(many text files).I guess using a combiner to find the desired stuff for within the numbers processed by a single mapper first would make it more efficient.
However I am concerned about the fact that, in order to be able to find the average, smallest number or largest number we would require the data from all mappers(and hence all combiners) to go to a single reducer, so that we can find universal average, smallest number or largest number .Which in case of larger data sets would be a huge bottleneck.
I am sure there would be some way out to handle this issue in hadoop that I probably can not think of.Can someone please guide me.I have been asked this sort of questions in couple of interviews as well.
Also while running my 'Find Average' mapreduce program I am facing an issue, the only running mapper is taking too long to complete.I have increased the map task time-out as well but it still gets stuck.Whereas with the help of stdout logs I have found that my mapper and combiner are executed smoothly.Hence I am not able to figure out what is causing my mapreduce job to hang.