Confusion about the number of partitions after a group by

Asked Oct 13 '18 at 18:22

Active Oct 13 '18 at 18:22

Viewed 38 times

How spark partitions the data after applying the groupby transformation while using the spark sql. Is it equal to spark.sql.shuffle.partitions or the number of groups which can be created by the data. Below is the sample code and data.

var studentDF = createStudentDF()       //student: <Id, Name, Department>
studentDF.registerTempTable("students")
x = sqlContext.sql("select depid, count(*) from students group by depid")
x.rdd.count

Since the value of spark.sql.shuffle.partitions is 200 by default, I was expecting 200 partitions but there were only 3 deps when run on the below data.

ID NAME DEPID
1  ABC  1  
2  DEF  1  
3  GHI  2  
4  JKL  2  
5  MNO  3

Is it so that the spark eliminates the empty partitions after the operation completes or I am missing something obvious here.

asked Oct 13 '18 at 18:22

Madhusoodan P

@user6910411 please take a look at the complete question – Madhusoodan P Oct 13 '18 at 18:54
OK, let's be more explicit :) `x.rdd.count` doesn't check the number of partitions. Please check the linked thread to learn how to properly check the number of partitions. – zero323 Oct 13 '18 at 23:11

Confusion about the number of partitions after a group by

0 Answers0