2

In the simple case of a Spark dataset that contains partitions where each key is only present in a single partition, like in the case of the following two partitions :

  1. [ ("a", 1), ("a", 2) ]
  2. [ ("b", 1) ],

will a shuffle operation (like groupByKey) generally shuffle the data accross partitions, even though there is no need for this?

I am asking this question because shuffling is expensive, so this matters, for large datasets. My use case is exactly this: a large dataset where each key almost always sits in a single partition.

Eric O Lebigot
  • 81,422
  • 40
  • 198
  • 249

1 Answers1

2

Well, it depends. By default groupByKey is using a HashPartitioner. Lets assume you have only two partitions. It means that pairs with key "a" will go to partition number 1

scala> "a".hashCode % 2
res17: Int = 1

and pairs with key "b" to partition 2

scala> "b".hashCode % 2
res18: Int = 0

If you create RDD like this:

val rdd = sc.parallelize(("a", 1) :: ("a", 2) :: ("b", 1) :: Nil, 2).cache

there are multiple scenarios how data is distributed. First we'll need a small helper:

def addPartId[T](iter: Iterator[T]) = {
  Iterator((TaskContext.get.partitionId, iter.toList))
}

Scenario 1

rdd.mapPartitions(addPartId).collect
Array((0,List((b,1))), (1,List((a,1), (a,2))))

No data movement required since all pairs are already on the right partition

Scenario 2

Array((0,List((a,1), (a,2))), (1,List((b,1))))

Although matching pairs are already on the same partition all pairs have to be moved since partition IDs don't match keys

Scenario 3

Some mixed distribution where only a part of the data have to be moved:

Array((0,List((a,1))), (1,List((a,2), (b,1))))

If data is partitioned using HashPartioner before groupByKey there is no need for shuffling whatsoever.

val rddPart = rdd.partitionBy(new HashPartitioner(2)).cache
rddPart.mapPartitions(addPartId).collect

Array((0,List((b,1))), (1,List((a,1), (a,2))))

rddPart.groupByKey
zero323
  • 283,404
  • 79
  • 858
  • 880
  • Do you have a reference that shows that "by default groupByKey is using a HashPartitioner"? I imagined that this was the case, but was unable to google this (not knowing the class name HashPartitioner). Now, after googling with this concept, it looks to me like it is actually possible to avoid the shuffle by specifying a custom partitioner for the RDD (https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html), no? – Eric O Lebigot Aug 22 '15 at 00:22
  • 1
    You can for example check [`pyspark.rdd.groupBy`](https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L1837) or its equivalents in the Scala API. Empirically you can compare `rdd.partitioner` and `rdd.groupByKey.partitioner`. Regarding custom partitioner I don't understand why you think it can help you here. Theoretically it could be possible to compute key distribution statistics first and then try to optimize key-partition mapping but there is no guarantee you'll get a better result and it is not a trivial problem. – zero323 Aug 22 '15 at 09:10
  • Great reference! A custom partitioner helps in my case, as far as I understand: many files are read, with each line generating a key; the special property of these files is that a given key is *almost* always in a single file. However, a shuffle is needed in order to bring back those rare keys that span multiple files. I am thinking of calculating a partitioner that maps each key to the partition where it is found the most often (in my specific case, there should probably be a single such partition). You are right, though: this might not make the whole operation faster. – Eric O Lebigot Aug 22 '15 at 13:07
  • It is an interesting approach but it can be fragile. It is more likely you'll get significantly unbalanced distribution. For example on input like this (`|` delimits partitions) `|AAAABB|BBBCCC|CCDDDD|` -> `|AAAA|BBBBBCCCCC|DDDD|`. Another problem I see is that it "pollutes" a whole downstream pipeline and doesn't work so well when you have to increase number of partitions ([after join](http://stackoverflow.com/a/31662127/1560062) for example). – zero323 Aug 22 '15 at 17:12
  • I have some idea how to handle case like yours using standard hash partitioner with combination of a preprocessed input, externally computed hashes and `wholeTextFiles` but I'll need some time to test it. – zero323 Aug 22 '15 at 17:31
  • Fortunately, my real use case is (probably) even simpler than the case that you are mentioning: the partitions contain things like |AAAABBCC…Z|Z1122…89|999…|, i.e. most keys are really in a single partition, with shared keys at the "seams" between partitions (each partition has millions of keys). Also, it would be inconvenient to preprocess the input (because it is changing all the time, and because the original data has a nice (time) structure). – Eric O Lebigot Aug 23 '15 at 01:30
  • Good explanation. But "Scenario 2" apprently can be optimized where shuffle is unncessary, right? Not sure if this is already fixed in Spark 2.4.3? – Leon May 10 '19 at 08:53