PySpark repartition fails grouping columns

Question

Take this sample code as an example.

Here I generate a dataframe where 50 unique entries are repeated 8 times.

for i in range(49):
    uno = random.random()
    due = random.random()
    l = [(uno, due)]
    dfu = dfu.union(spark.createDataFrame(l, ['uno', 'due']))

for i in range(3):
    dfu = dfu.union(dfu)

print('Elements:', dfu.count())

Elements: 400

Then, I inspect my dataframe's rdd. 2400 partitions make no sense in the first place, since 2000 of them are empty, but this is not the issue I'm addressing.

# ORIGINAL
print("Original partitions:", dfu.rdd.getNumPartitions())
pprint(dfu.rdd.glom().collect())

Original partitions: 2400

I repartition by column specifying I want 400 partitions. What I expect is 400 partitions each of which contains the same value for the field "uno" I'm partitioning by. Since I only have 50 unique values for field "uno", I expect 50 non-empty partitions and 350 empty ones.

# REPARTITIONS
df1 = dfu.repartition(400, "uno").sortWithinPartitions("due")
print("Repartitions:", df1.rdd.getNumPartitions())
pprint(df1.rdd.glom().collect())

Repartitions: 400

What I get is 400 partitions indeed, but some of them contains more than one unique value; in other words, I get less than 50 non-empty partitions.

This is frustrating, first because it is an unexpected behavior not described in the APIs, but mostly because I may want to run df1.mapPartitions() and write some code assuming each partition contains only unique elements. Observe, I am choosing more partitions than I actually need (that is 50).

Why is Spark behaving this way? Am I missing something?

For the sake of clarity, here is a snapshot of a chunk of my output; different values for "uno" should belong to different partitions.

[],
  [ Row(uno=0.06541487834242865, due=0.8924866228784675),
    Row(uno=0.06541487834242865, due=0.8924866228784675),
    Row(uno=0.06541487834242865, due=0.8924866228784675),
    Row(uno=0.06541487834242865, due=0.8924866228784675),
    Row(uno=0.06541487834242865, due=0.8924866228784675),
    Row(uno=0.06541487834242865, due=0.8924866228784675),
    Row(uno=0.06541487834242865, due=0.8924866228784675),
    Row(uno=0.06541487834242865, due=0.8924866228784675),
    Row(uno=0.9409267037450175, due=0.901923815270492),
    Row(uno=0.9409267037450175, due=0.901923815270492),
    Row(uno=0.9409267037450175, due=0.901923815270492),
    Row(uno=0.9409267037450175, due=0.901923815270492),
    Row(uno=0.9409267037450175, due=0.901923815270492),
    Row(uno=0.9409267037450175, due=0.901923815270492),
    Row(uno=0.9409267037450175, due=0.901923815270492),
    Row(uno=0.9409267037450175, due=0.901923815270492)],
  [],

I don't undestand how that answer is relevant. Could you please elaborate? — Gianluca, Dec 19 '17 at 16:25
This is expected behavior for hash partitioning. Your assumptions are just incorrect and not justified: [_Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned_](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=repartition#pyspark.sql.DataFrame.repartition) — Alper t. Turker, Dec 19 '17 at 16:31

PySpark repartition fails grouping columns

0 Answers0