0

To understand how Spark partitioning works, I have the following piece of code on spark 1.6

// Count size of partition for RDD[(String, Int)]
def countByPartition1(rdd: RDD[(String, Int)]) = {
    rdd.mapPartitions(iter => Iterator(iter.length))
}

// Count size of partition for RDD[String]
def countByPartition2(rdd: RDD[String]) = {
    rdd.mapPartitions(iter => Iterator(iter.length))
}

// Case 1
val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1), ("aa", 1)), 8)
countByPartition1(rdd1).collect()
>> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)

// Case 2
val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
countByPartition2(rdd2).collect()
>> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)

In both, the cases data is distributed uniformly. I do have the following questions on the basis of the above observation:

  1. In the case of rdd1, hash partitioning should calculate hash code of the key (i.e. "aa" in this case), so all records should go to a single partition instead of uniform distribution?
  2. In the case of rdd2, there is no key-value pair so how is hash partitioning going to work i.e. what is the key to calculating hash code?

I have followed @zero323's answer too but couldn't find answers for the above questions.

Vikash Pareek
  • 692
  • 9
  • 22

1 Answers1

0

On such initial assignment from reading a file or generating yourself from the driver there is no actual partitioner like hash applied.

If you run val p = rdd1.partitioner, you will see the value None.

  • Given N values for an RDD of V or (K,V) format - then the V or (K,V) format is not actually relevant:

    • then for M partitions, Spark must have an algorithm for calculating where to place what data, else it could never proceed past this step and we could not get on with things!

      • then Spark will place an equal amount of data initially at equal intervals based on next integer (M/N).

        • So if I have 4 values with 10 partitions, every next higher integer of (10 / 2.5) steps Spark will place data. That is what you are seeing. Same applies to 4 values with 8 partitions - as you saw.

There is no hashing applied yet. Edge case is initial allocation. For driver created RDDs this is how it works, for file based sources a little different.

thebluephantom
  • 11,806
  • 6
  • 26
  • 54