Pyspark dataframe repartitioning puts all data in one partition

Question

I have a dataframe with schema as follows:

root
 |-- category_id: string (nullable = true)
 |-- article_title: string (nullable = true)

And data that looks likes this:

+-----------+--------------------+
|category_id|     articletitle   |
+-----------+--------------------+
|       1000|HP EliteOne 800 G...|
|       1000|ASUS  EB1501P ATM...|
|       1000|HP EliteOne 800 G...|
|          1|ASUS R557LA-XO119...|
|          1|HP EliteOne 800 G...|
+-----------+--------------------+

There are just two distinct category_id 1000 and 1.

I want to do a repartition by category_id and mapPartition on each of the partitions.

p_df = df.repartition(2, "category_id")
p_df.rdd.mapPartitionsWithIndex(some_func)

But data is not getting partitioned correctly, the expected result is that each mappartition will have data only for one category_id. But actual result is that one partition gets 0 records while the other gets all the records.

Why is this happening and how to fix this?

There is already a question on how spark partitioner works. My question is different as the answers contain only explanation on how the partitioner works, but my question is about why this happens (which is answered already) and how to fix it.

How did you arrive to the conclusion that one partition is empty and the other one has all the records? Can you add the output of `p_df.withColumn("partition" , spark_partition_id()).show()` ? — philantrovert, Jan 08 '18 at 07:23
It's alright. It gives accurate partitioning for Spark 1.6 but gives the same partition id for all records in Spark 2.2. — philantrovert, Jan 08 '18 at 12:04

score 4 · Answer 1 · answered Jan 08 '18 at 10:10

You have used the repartition and mapPartitionsWithIndex functions correctly.

If you apply explain function as

df.repartition(2, "category_id").explain()

you will see the following output which clearly says that its repartitioned into two partitions.

== Physical Plan ==
Exchange hashpartitioning(category_id#0L, 2)
+- Scan ExistingRDD[category_id#0L,articletitle#1L]

Now the real culprit is the hashPartitioning which treats 1, 10, 1000, 100000 ... as the same hashes as the partition number is 2

The solution would be to change the partitioning number to 3 or more,

or

change the category_id 1000 to something else.

score 0 · Accepted Answer · answered Jan 24 '18 at 05:09

The reason on why repartition is putting all data in one partition is explained by @Ramesh Maharjan in the above answer. More on hash partitioning here

I was able to make data go to different partitioner by using a custom partitioner. I made the rdd into a pairRdd format the (category_id, row) and used the partitionBy method giving in the number of partitions and custom_partitioner.

    categories = input_df.select("category_id").distinct().rdd.map(lambda r: r.category_id).collect()
    cat_idx = dict([(cat, idx) for idx, cat in enumerate(categories)])

    def category_partitioner(cid):
        return cat_idx[cid]

Pyspark dataframe repartitioning puts all data in one partition

2 Answers2