I'm a beginner with spark and trying to solve skewed data problem. I'm using an algorithm from a colleague to distribute the data based on a key column. But the problem is that when I repartition(col("keyColumn")) the dataframe, spark merges few of the partitions and makes bigger output files. I'm guessing this is because of the key.Hashcode() % num_partitions.
This is data distribution looks like this: Max one file can have 500,000 records
| key |count |
+-----+------+
|1 |495941|
|2 |499607|
|3 |498896|
|4 |502845|
|5 |498213|
|6 |501325|
|7 |502355|
|8 |501816|
|9 |498829|
|10 |498272|
|11 |499802|
|12 |501580|
|13 |498779|
|14 |498654|
But when I look into the files after repartition, the file contains more than 2 keys which makes the file double of the size.
+----+------+
|key |count |
+----+------+
|101 |500014|
|115 |504995|
+----+------+
I also tried to generate prime random prime keys instead of incremental partition keys, but still few files are bigger and contains more than one key.
When I use the repartition(max_partition_no, col("KeyColumn")) it shuffles the data much more and I get much bigger files of around 900Mb. Expected file size is between 250 - 350 MB.
My question is, how do I make sure that the repartition works? Also, how do I pass a Custom Partitioner by overriding the hashCode to repartition function?