spark dataframe groupBy task number

Question

I run at local mode and init with 2 partition. when I use DataFrame.show(), it gets like this: INFO scheduler.TaskSetManager: Finished task 1.0 in stage 3.0 (TID 5) in 390 ms on localhost (2/2). But when I use DataFrame.groupBy(), it gets so many tasks just like this:INFO scheduler.TaskSetManager: Finished task 83.0 in stage 15.0 (TID 691) in 644 ms on localhost (84/200). My source code is here.

everyIResDF.show()
val resDF = everyIResDF
  .groupBy("dz_id","dev_id","dev_type","time")
  .avg("IRes")
resDF.show()

I want to know why groupBy() would cause this matter and how to solve it. Any help is useful.

What is your problem? That the number of tasks increase a lot when using `groupBy()`? It's an expensive operation that requires a complete reshuffle of your data. — Shaido, Aug 23 '17 at 05:55
I see it's not a good way to use groupBy(). I just want to know the rule how tasks create and why cause too many task number — ulysses, Aug 24 '17 at 00:48

score 1 · Accepted Answer · answered Aug 24 '17 at 01:34

1

A task will be launch for each partition you have in each stage. You initialize the dataframe with 2 partitions, hence, the number of tasks will be low (2) for your first INFO print.

However, each time there Spark needs to perform a shuffle of the data it will decide and change how many partitions the shuffle RDD will have. The default value is 200. Therefore, after using groupBy() which requires a full data shuffle, the number of tasks will have increased to 200 (as seen in your second INFO print).

The number of partitions to use when shuffling data can be set by changing Spark's configuration, for example to set it to 4, simply do:

sqlContext.setConf("spark.sql.shuffle.partitions", "4”)

By running the code with this configuration you will no longer see such a high number of tasks. The optimal number of partitions depends on many things but can heuristically be set to 3 or 4 times your number of cores.

answered Aug 24 '17 at 01:34

Shaido

22,716
18
57
64

It helps me a lot ! If my application has many place to shuffle data, can I change the `shuffle.partitions` when application is running ? And at spark-submit, is this conf'key `spark.shuffle.sort.bypassMergeThreshold` ? – ulysses Aug 24 '17 at 02:34
@ulysses I believe the easiest way would be to set the number of partitions on the dataframe itself by doing `df.repartition(numOfPartitions)`. For the spark-submit the key is still `spark.sql.shuffle.partitions` although you may want to set `spark.default.parallelism` aswell, see https://stackoverflow.com/questions/45704156/spark-sql-shuffle-partitions-and-spark-default-parallelism/45704560#45704560. – Shaido Aug 24 '17 at 02:47
Get it . where are these confs come from? I only see some conf at `http://spark.apache.org/docs/latest/configuration.html` – ulysses Aug 24 '17 at 04:02
The sql conf can be found here: https://spark.apache.org/docs/latest/sql-programming-guide.html. The other rdd one is from the page you have. – Shaido Aug 24 '17 at 04:22

spark dataframe groupBy task number

1 Answers1