Well, it depends. By default groupByKey
is using a HashPartitioner
. Lets assume you have only two partitions. It means that pairs with key "a" will go to partition number 1
scala> "a".hashCode % 2
res17: Int = 1
and pairs with key "b" to partition 2
scala> "b".hashCode % 2
res18: Int = 0
If you create RDD like this:
val rdd = sc.parallelize(("a", 1) :: ("a", 2) :: ("b", 1) :: Nil, 2).cache
there are multiple scenarios how data is distributed. First we'll need a small helper:
def addPartId[T](iter: Iterator[T]) = {
Iterator((TaskContext.get.partitionId, iter.toList))
}
Scenario 1
rdd.mapPartitions(addPartId).collect
Array((0,List((b,1))), (1,List((a,1), (a,2))))
No data movement required since all pairs are already on the right partition
Scenario 2
Array((0,List((a,1), (a,2))), (1,List((b,1))))
Although matching pairs are already on the same partition all pairs have to be moved since partition IDs don't match keys
Scenario 3
Some mixed distribution where only a part of the data have to be moved:
Array((0,List((a,1))), (1,List((a,2), (b,1))))
If data is partitioned using HashPartioner
before groupByKey
there is no need for shuffling whatsoever.
val rddPart = rdd.partitionBy(new HashPartitioner(2)).cache
rddPart.mapPartitions(addPartId).collect
Array((0,List((b,1))), (1,List((a,1), (a,2))))
rddPart.groupByKey