0

I have read spark documentation and I would like to be sure I am doing the right thing. https://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks

Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large.

How does this solution comes with "input file split size"? my understanding is that a lot of tasks would create lot of small files. Should I repartition data to smaller number of partitions after a shuffle operation?

David H
  • 1,196
  • 3
  • 13
  • 25
  • The advice I've heard was that you want your output files to be about the size of an HDFS block (but never over), so re partitioning to get that could be a good idea. – puhlen Jan 31 '17 at 20:29
  • When you repartition data from input files, a good option is to use a HashPartitioner. Detailed explanation is http://stackoverflow.com/questions/31424396/how-does-hashpartitioner-work – sparkDabbler Feb 01 '17 at 03:56
  • It seems like I am using BroadcastNestedLoopJoin, and application get stuck when I use this kind of join. the broadcast is about 32mb size – David H Feb 03 '17 at 20:07

0 Answers0