11

I was learning hadoop, I found number of reducers very confusing :

1) Number of reducers is same as number of partitions.

2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).

3) Number of reducers is set by mapred.reduce.tasks.

4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.

I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?

How is number of reducers is calculated? Please tell me how to calculate number of reducers.

Community
  • 1
  • 1
Mohit Jain
  • 327
  • 2
  • 4
  • 16

4 Answers4

3

1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.

2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).

3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.

4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.

There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.

ViKiG
  • 737
  • 9
  • 18
  • Thanks for the reply, I got your 1,2 and 3 point. But I think if we set mapred.reduce.tasks, then it will be number of reducers. Correct me if I am wrong. So i think this happens like this that number of reducers we can set using mapred.reduce.tasks of setnumReducetasks() method and number of partition, divides data among reducer tasks. Please clarify. – Mohit Jain Jul 05 '16 at 03:59
  • 1
    Yes most of the times `setNumReduceTasks()` method call in the driver class works. Sometimes I have seen when I set the number of reducers to 6 when required is only 2 the ApplicationManager simply runs extra 4 empty reducers doing nothing. And if you set the number of reducers less than the required then it might honor your setting but that would not be optimized setting to run MapReduce. Usually what I do is calculate the input record number size before MapReduce and set the approximate reducers it might need. – ViKiG Jul 05 '16 at 06:04
  • @ViKiG Regarding the point 3. If hadoop uses its own algorithm to calculate the optimal number of reducers why do I need to provide the number of reducers ? – Bemipefe Jan 01 '19 at 15:54
  • @Bemipefe If the number of reducers given in `mapred-site.xml` is 6 and actually possible or needed reducers is 2, then it will create only 2 but not 6. If the reducers required is more than 6, probably the reducers will be only 6 even thought having >6 is good. Most of this I understood by experimentation and comparing with the documentation. – ViKiG Jan 02 '19 at 06:27
2

Number of reducer is internally calculated from size of the data we are processing if you don't explicitly specify using below API in driver program

job.setNumReduceTasks(x)

By default on 1 GB of data one reducer would be used.

so if you are playing with less than 1 GB of data and you are not specifically setting the number of reducer so 1 reducer would be used .

Similarly if your data is 10 Gb so 10 reducer would be used .

You can change the configuration as well that instead of 1 GB you can specify the bigger size or smaller size.

property in hive for setting size of reducer is :

hive.exec.reducers.bytes.per.reducer

you can view this property by firing set command in hive cli.

Partitioner only decides which data would go to which reducer.

user3484461
  • 1,041
  • 9
  • 14
2

Your job may or may not need reducers, it depends on what are you trying to do. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output. Too many reducers and you end up with lots of small files.

red
  • 136
  • 1
  • 8
1

Partitioner makes sure that same keys from multiple mappers goes to the same reducer. This doesn't mean that number of partitions is equal to number of reducers. However, you can specify number of reduce tasks in the driver program using job instance like job.setNumReduceTasks(2). If you don't specify the number of reduce tasks in the driver program then it picks from the mapred.reduce.tasks which has the default value of 1 (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) i.e. all mappers output will go to the same reducer.

Also, note that programmer will not have control over number of mappers as it depends on the input split where as programmer can control the number of reducers for any job.

gunner87
  • 21
  • 4
  • Thanks for the comment, If there are three partitions and we set number of reduce tasks to 2, then how will data be divided, Will be like data for 2 practitioners will go to one and data from one partition will go to other reducer? Also we can set input split size, so we can set number of mappers. – Mohit Jain Jul 05 '16 at 03:49
  • If there are 3 partitions then data is already divided and the master will assign the reducers to the 3 partitions. Master will be getting heart beat messages from the data nodes which contains information about its availability, resources etc. Master uses these information while scheduling. The reducer which gets the 2 partitions will process one partition after the another. More information about number of reducers and mappers can be found at this link: http://stackoverflow.com/questions/6885441/setting-the-number-of-map-tasks-and-reduce-tasks – gunner87 Jul 05 '16 at 05:19
  • @gunner87 I believe that if mapred.reduce.tasks is not provided ***the default is 1 only if all the partitions can fit in a single node***. What if the produced partition size exceed the HDFS free space on a single node ? – Bemipefe Jan 01 '19 at 16:02