2

In all examples, I always see partitionBy receiving a new instance of a HashPartitioner

val rddTenP = rdd.partitionBy(new HashPartitioner(10))

I am joining two RDDs. They have key column that has values from the same set userId. Should I partition both of them for the join to be more efficient? If yes, should I create one HashPartitioner instance hp

val hp: HashPartitioner = new spark.HashPartitioner(84) and pass hp to both partitionBy methods, so as to have the rows to be joined fall to the same node? Is that the way partitionBy works?

Ozgur Ozturk
  • 1,055
  • 10
  • 9

1 Answers1

2

You are on the right way using the same partitioner to optimize your joins (by avoiding shuffles). You can use the same instance of hash partitioner since it is immutable. But if you use 2 instances of hash partitioner with the same number of partitions parameter (roughly, partitionIndex = key.hasCode mod numOfPartitions), it works too because they are equal:

 override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }

For details and detailed explanations how it works, see: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala And How does HashPartitioner work?

Community
  • 1
  • 1
Vitalii Kotliarenko
  • 2,843
  • 15
  • 24
  • Thanks Vitaliy. In Summary, using `new HashPartitioner(10)` in each partitionBy actually creates the same HashPartitioner and so would create same partitions for same keys... – Ozgur Ozturk May 06 '16 at 15:34