To understand how Spark partitioning works, I have the following piece of code on spark 1.6
// Count size of partition for RDD[(String, Int)]
def countByPartition1(rdd: RDD[(String, Int)]) = {
rdd.mapPartitions(iter => Iterator(iter.length))
}
// Count size of partition for RDD[String]
def countByPartition2(rdd: RDD[String]) = {
rdd.mapPartitions(iter => Iterator(iter.length))
}
// Case 1
val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1), ("aa", 1)), 8)
countByPartition1(rdd1).collect()
>> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
// Case 2
val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
countByPartition2(rdd2).collect()
>> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
In both, the cases data is distributed uniformly. I do have the following questions on the basis of the above observation:
- In the case of rdd1, hash partitioning should calculate hash code of the key (i.e. "aa" in this case), so all records should go to a single partition instead of uniform distribution?
- In the case of rdd2, there is no key-value pair so how is hash partitioning going to work i.e. what is the key to calculating hash code?
I have followed @zero323's answer too but couldn't find answers for the above questions.