Apache Spark dataframe repartition issue

Asked Sep 01 '17 at 09:45

Active Sep 01 '17 at 09:45

Viewed 81 times

I am using Spark 1.6, I have a dataframe; I repartition a dataframe on some key shown as below.

pairJdbcDF.repartition(pairJdbcDF.select($"Asset").distinct.count.toInt, $"Asset")

My observation is, let say I have 6 distinct keys key1, key2, key3, key4, key5, key6 and I am making 6 partitions for these 6 keys. I am able to see 6 partitions created by a spark as below.

Partition1: Empty
Partition2: Holds all values for key1
Partition3: Holds all values for key2,3
Partition4: Holds all values for key4
Partition5: Holds all values for key5
Partition6: Holds all values for key6

Can someone please explain me why spark keeps 1 partition empty and put records for 2 keys in single partition as shown as above for partition 3. This happens mostly for keys which have very less number of records.

asked Sep 01 '17 at 09:45

nilesh1212

1,327
2
15
45

Apache Spark dataframe repartition issue

0 Answers0