1

What is spark.sql.shuffle.partitions in a more technical sense? I have seen answers like here which says: "configures the number of partitions that are used when shuffling data for joins or aggregations."

What does that actually mean? How does shuffling work from node to node differently when this number is higher or lower?

Thanks!

Tanner Clark
  • 453
  • 6
  • 15

2 Answers2

6

Partitions define where data resides in your cluster. A single partition can contain many rows, but all of them will be processed together in a single task on one node.

Looking at edge cases, if we re-partition our data into a single partition, even if you have 100 executors, it will be only processed by one. Single partition explanation

On the other hand, if you have a single executor, but multiple partitions, they will be all (obviously) processed on the same machine. enter image description here

Shuffles happen, when one executor needs data from another - basic example is groupBy aggregation operation, as we need all related rows to calculate result. Irrespective of how many partitions we had before groupBy, after it spark will split results into spark.sql.shuffle.partitions

Quoting after "Spark - the definitive guide" by Bill Chambers and Matei Zaharia:

A good rule of thumb is that the number of partitions should be larger than the number of executors on your cluster, potentially by multiple factors depending on the workload. If you are running code on your local machine, it would behoove you to set this value lower because your local machine is unlikely to be able to execute that number of tasks in parallel.

So, to sum up, if you set this number lower than your cluster's capacity to run tasks, you won't be able to use all of its resources. On the other hand, since tasks are run on a single partitions, having thousands of small partitions would (I expect) have some overhead.

Daniel
  • 881
  • 6
  • 12
  • 1
    It's a nice explanation. It's also get affected during join operations. – vikrant rana Sep 05 '19 at 16:46
  • So a shuffle, shuffles the entire partition to a different executor? If this is true, what happens when you partition by a key then groupby a different key? (not that you should do this as I've picked up!) – Tanner Clark Sep 06 '19 at 11:12
  • 2
    Not exactly. Tasks save data into 'shuffle files' using different algorithms and later there is a whole another layer of compressing and sending this data between nodes. So results of processing a single partition can result in multiple partitions down the line. If you parition by one key and then group by another workers will exchange necessary rows but it will be very heavy operation. What's important is that spark will try to always optimize the execution plan - e.g. if you run join against a very small dataframe, it can be broadcasted to all of the executors and kept in-memory. – Daniel Sep 06 '19 at 12:57
  • @Daniel Hi, if I do not set spark.sql.shuffle.partitions parameter, then how spark decide the number of partition in which the rdd should be divided after the shuffle operation. For ex- In my case, I divided my data to 2 partition and I had 2 taks but after group by, I can see 203 tasks. – akash patel Oct 07 '20 at 14:15
  • Default value for spark.sql.shuffle.partitions is 200, so if you do not change it, that's the number of partitions you get after any shuffle operation – Daniel Oct 07 '20 at 14:42
2

spark.sql.shuffle.partitions is the parameter which determines how many blocks your shuffle will be performed in.

Say you had 40Gb of data and had spark.sql.shuffle.partitions set to 400 then your data will be shuffled in 40gb / 400 sized blocks (assuming your data is evenly distributed).

By changing the spark.sql.shuffle.partitions you change the size of blocks being shuffled and the number of blocks for each shuffle stage.

As Daniel says a rule of thumb is to never have spark.sql.shuffle.partitions set lower than the number of cores for a job.

Mark
  • 81
  • 6
  • Hi, if I do not set spark.sql.shuffle.partitions parameter, then how spark decide the number of partition in which the rdd should be divided after the shuffle operation. For ex- In my case, I divided my data to 2 partition and I had 2 taks but after group by, I can see 203 tasks. – akash patel Oct 07 '20 at 14:15