Apache Spark Partitioning in map()

Question

Can anyone explain me this?

The flipside, however, is that for transformations that cannot be guaranteed to pro‐ duce a known partitioning, the output RDD will not have a partitioner set. For example, if you call map() on a hash-partitioned RDD of key/value pairs, the function passed to map() can in theory change the key of each element, so the result will not have a partitioner. Spark does not analyze your functions to check whether they retain the key. Instead, it provides two other operations, mapValues() and flatMap Values(), which guarantee that each tuple’s key remains the same.

Source Learning Spark by Matei Zaharia, Patrick Wendell, Andy Konwinski, Holden Karau.

score 3 · Accepted Answer · answered Apr 27 '18 at 09:46

It is pretty simple:

Partitioner is a function from a key to partition - How does HashPartitioner work?
Partitioner can be applied on RDD[(K, V)] where K is the key.
Once you repartitioned using specific Partitioner all pairs with same key are guaranteed to reside on the same partition.

Now, let's consider two examples:

map takes function (K, V) => U and returns RDD[U] - in other words it transforms a whole Tuple2. It might or might not preserve key as is, it might not even return RDD[(_, _)] so partitioning is not preserved.
mapValues takes function (V) => U and returns RDD[(K, U)] - in other words it transforms only values. Key, which determines partition membership, is never touched, so partitioning is preserved.

Apache Spark Partitioning in map()

1 Answers1