I have read spark documentation and I would like to be sure I am doing the right thing. https://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks
Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large.
How does this solution comes with "input file split size"? my understanding is that a lot of tasks would create lot of small files. Should I repartition data to smaller number of partitions after a shuffle operation?