In all examples, I always see partitionBy receiving a new instance of a HashPartitioner
val rddTenP = rdd.partitionBy(new HashPartitioner(10))
I am joining two RDDs. They have key column that has values from the same set userId
. Should I partition both of them for the join to be more efficient? If yes, should I create one HashPartitioner instance hp
val hp: HashPartitioner = new spark.HashPartitioner(84)
and pass hp to both partitionBy methods, so as to have the rows to be joined fall to the same node? Is that the way partitionBy works?