spark repartition by column

Question

I'm a beginner with spark and trying to solve skewed data problem. I'm using an algorithm from a colleague to distribute the data based on a key column. But the problem is that when I repartition(col("keyColumn")) the dataframe, spark merges few of the partitions and makes bigger output files. I'm guessing this is because of the key.Hashcode() % num_partitions.

This is data distribution looks like this: Max one file can have 500,000 records

| key |count |
+-----+------+
|1    |495941|
|2    |499607|
|3    |498896|
|4    |502845|
|5    |498213|
|6    |501325|
|7    |502355|
|8    |501816|
|9    |498829|
|10   |498272|
|11   |499802|
|12   |501580|
|13   |498779|
|14   |498654|

But when I look into the files after repartition, the file contains more than 2 keys which makes the file double of the size.

+----+------+
|key |count |
+----+------+
|101 |500014|
|115 |504995|
+----+------+

I also tried to generate prime random prime keys instead of incremental partition keys, but still few files are bigger and contains more than one key.

When I use the repartition(max_partition_no, col("KeyColumn")) it shuffles the data much more and I get much bigger files of around 900Mb. Expected file size is between 250 - 350 MB.

My question is, how do I make sure that the repartition works? Also, how do I pass a Custom Partitioner by overriding the hashCode to repartition function?

Possible duplicate of [How does HashPartitioner work?](https://stackoverflow.com/questions/31424396/how-does-hashpartitioner-work) — 10465355, Nov 29 '18 at 12:07
Hello ! For the first question, you can use `dataframe.explain()` to get the sequence of operations done by spark that lead to the variable state. You'll find where the `repartition()` happens there. — hyperc54, Nov 29 '18 at 12:48

spark repartition by column

0 Answers0