Take this sample code as an example.
Here I generate a dataframe where 50 unique entries are repeated 8 times.
for i in range(49):
uno = random.random()
due = random.random()
l = [(uno, due)]
dfu = dfu.union(spark.createDataFrame(l, ['uno', 'due']))
for i in range(3):
dfu = dfu.union(dfu)
print('Elements:', dfu.count())
Elements: 400
Then, I inspect my dataframe's rdd. 2400 partitions make no sense in the first place, since 2000 of them are empty, but this is not the issue I'm addressing.
# ORIGINAL
print("Original partitions:", dfu.rdd.getNumPartitions())
pprint(dfu.rdd.glom().collect())
Original partitions: 2400
I repartition by column specifying I want 400 partitions. What I expect is 400 partitions each of which contains the same value for the field "uno" I'm partitioning by. Since I only have 50 unique values for field "uno", I expect 50 non-empty partitions and 350 empty ones.
# REPARTITIONS
df1 = dfu.repartition(400, "uno").sortWithinPartitions("due")
print("Repartitions:", df1.rdd.getNumPartitions())
pprint(df1.rdd.glom().collect())
Repartitions: 400
What I get is 400 partitions indeed, but some of them contains more than one unique value; in other words, I get less than 50 non-empty partitions.
This is frustrating, first because it is an unexpected behavior not described in the APIs, but mostly because I may want to run df1.mapPartitions()
and write some code assuming each partition contains only unique elements. Observe, I am choosing more partitions than I actually need (that is 50).
Why is Spark behaving this way? Am I missing something?
For the sake of clarity, here is a snapshot of a chunk of my output; different values for "uno" should belong to different partitions.
[],
[ Row(uno=0.06541487834242865, due=0.8924866228784675),
Row(uno=0.06541487834242865, due=0.8924866228784675),
Row(uno=0.06541487834242865, due=0.8924866228784675),
Row(uno=0.06541487834242865, due=0.8924866228784675),
Row(uno=0.06541487834242865, due=0.8924866228784675),
Row(uno=0.06541487834242865, due=0.8924866228784675),
Row(uno=0.06541487834242865, due=0.8924866228784675),
Row(uno=0.06541487834242865, due=0.8924866228784675),
Row(uno=0.9409267037450175, due=0.901923815270492),
Row(uno=0.9409267037450175, due=0.901923815270492),
Row(uno=0.9409267037450175, due=0.901923815270492),
Row(uno=0.9409267037450175, due=0.901923815270492),
Row(uno=0.9409267037450175, due=0.901923815270492),
Row(uno=0.9409267037450175, due=0.901923815270492),
Row(uno=0.9409267037450175, due=0.901923815270492),
Row(uno=0.9409267037450175, due=0.901923815270492)],
[],