How Spark distributes data using Hash Partitioner?

Question

To understand how Spark partitioning works, I have the following piece of code on spark 1.6

// Count size of partition for RDD[(String, Int)]
def countByPartition1(rdd: RDD[(String, Int)]) = {
    rdd.mapPartitions(iter => Iterator(iter.length))
}

// Count size of partition for RDD[String]
def countByPartition2(rdd: RDD[String]) = {
    rdd.mapPartitions(iter => Iterator(iter.length))
}

// Case 1
val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1), ("aa", 1)), 8)
countByPartition1(rdd1).collect()
>> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)

// Case 2
val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
countByPartition2(rdd2).collect()
>> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)

In both, the cases data is distributed uniformly. I do have the following questions on the basis of the above observation:

In the case of rdd1, hash partitioning should calculate hash code of the key (i.e. "aa" in this case), so all records should go to a single partition instead of uniform distribution?
In the case of rdd2, there is no key-value pair so how is hash partitioning going to work i.e. what is the key to calculating hash code?

I have followed @zero323's answer too but couldn't find answers for the above questions.

Why are you using ancient Spark 1.6? Better use a fresh version (2.4.5 or 3.0.0-preview) to understand how Spark works nowadays. — mazaneicha, May 31 '20 at 13:57

thebluephantom · Answer 1 · 2020-05-31T12:45:27.860

On such initial assignment from reading a file or generating yourself from the driver there is no actual partitioner like hash applied.

If you run val p = rdd1.partitioner, you will see the value None.

Given N values for an RDD of V or (K,V) format - then the V or (K,V) format is not actually relevant:
- then for M partitions, Spark must have an algorithm for calculating where to place what data, else it could never proceed past this step and we could not get on with things!
  - then Spark will place an equal amount of data initially at equal intervals based on next integer (M/N).
    - So if I have 4 values with 10 partitions, every next higher integer of (10 / 2.5) steps Spark will place data. That is what you are seeing. Same applies to 4 values with 8 partitions - as you saw.

There is no hashing applied yet. Edge case is initial allocation. For driver created RDDs this is how it works, for file based sources a little different.

How Spark distributes data using Hash Partitioner?

1 Answers1