How spark partitions the data after applying the groupby transformation while using the spark sql. Is it equal to spark.sql.shuffle.partitions
or the number of groups which can be created by the data. Below is the sample code and data.
var studentDF = createStudentDF() //student: <Id, Name, Department>
studentDF.registerTempTable("students")
x = sqlContext.sql("select depid, count(*) from students group by depid")
x.rdd.count
Since the value of spark.sql.shuffle.partitions
is 200 by default, I was expecting 200 partitions but there were only 3 deps when run on the below data.
ID NAME DEPID
1 ABC 1
2 DEF 1
3 GHI 2
4 JKL 2
5 MNO 3
Is it so that the spark eliminates the empty partitions after the operation completes or I am missing something obvious here.