By definition, repartition(numPartitions)
reshuffles the data in the RDD randomly to create either more or fewer partitions and balance it across them, which always shuffles all data over the network.
The guarantee that Apache Spark give is that is distributed evenly, but this won't yields exactly the same number of elements per partition. (Also the size of that dataset is very small !)
You might consider using HashPartitioner
:
scala> val rdd = sc.parallelize(for { x <- 1 to 36 } yield (x, None), 8)
rdd: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[31] at parallelize at <console>:27
scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD
scala> import org.apache.spark.HashPartitioner
import org.apache.spark.HashPartitioner
scala> def countByPartition(rdd: RDD[(Int, None.type)]) = rdd.mapPartitions(iter => Iterator(iter.length))
countByPartition: (rdd: org.apache.spark.rdd.RDD[(Int, None.type)])org.apache.spark.rdd.RDD[Int]
scala> countByPartition(rdd).collect
res25: Array[Int] = Array(4, 5, 4, 5, 4, 5, 4, 5)
scala> countByPartition(rdd.partitionBy(new HashPartitioner(12))).collect
res26: Array[Int] = Array(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
I have borrowed the example and helper from zero323's answer about How does HashPartitioner work?
I hope this helps !
EDIT:
If you would have done the following :
scala> val rdd = sc.parallelize(for { x <- 1 to 36 } yield (x, None), 12)
rdd: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[36] at parallelize at <console>:29
scala> countByPartition(rdd).collect
res28: Array[Int] = Array(4, 5, 4, 5, 4, 5, 4, 5)
Results wouldn't be necessarily the same.