3

I have an RDD of 36 elements. I have a cluster of 3 nodes with 4 cores each. I have repartitioned the RDD in 36 parts so that each partition might have an element to process but the entire 36 elements are partitioned such that only 4 parts has 9 elements each and rest of the parts are empty and hence have nothing to process and the server resources are underutilized.

How can I repartition the data to ensure that every part has some data to process? How can I ensure that every part has exactly 3 element to process?

eliasah
  • 35,948
  • 8
  • 110
  • 142
Ravi Ranjan
  • 353
  • 1
  • 5
  • 22
  • Are you using `coalesce` or `repartition`? I guess it might also be because you have very few elements. – philantrovert Aug 21 '17 at 09:16
  • I am using repartition. Yes, i have too few elements, in this case only 36. But each element has a lot of processing to do. I want each partition to have some data, rather than uneven repartitioning – Ravi Ranjan Aug 21 '17 at 09:17
  • @philantrovert is there a solution to this question because i have million of records but some partitions doesn't receives data at all and some get as many as 5 partition's data – Ankush Singh Aug 21 '17 at 09:20
  • 1
    @AnkushSingh repartition should do it, because it shuffles data across all partitions and then the resulting partitions must have nearly equal amount of data. – philantrovert Aug 21 '17 at 09:21
  • First you say that you have 36 records then million of records... So what case it is ? – eliasah Aug 21 '17 at 09:33
  • @eliasah a different person raised the question about a million records. though we both have the same problem – Ravi Ranjan Aug 21 '17 at 09:36

1 Answers1

7

By definition, repartition(numPartitions) reshuffles the data in the RDD randomly to create either more or fewer partitions and balance it across them, which always shuffles all data over the network.

The guarantee that Apache Spark give is that is distributed evenly, but this won't yields exactly the same number of elements per partition. (Also the size of that dataset is very small !)

You might consider using HashPartitioner :

scala> val rdd = sc.parallelize(for { x <- 1 to 36 } yield (x, None), 8) 
rdd: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[31] at parallelize at <console>:27

scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD

scala> import org.apache.spark.HashPartitioner
import org.apache.spark.HashPartitioner

scala> def countByPartition(rdd: RDD[(Int, None.type)]) = rdd.mapPartitions(iter => Iterator(iter.length))
countByPartition: (rdd: org.apache.spark.rdd.RDD[(Int, None.type)])org.apache.spark.rdd.RDD[Int]

scala> countByPartition(rdd).collect
res25: Array[Int] = Array(4, 5, 4, 5, 4, 5, 4, 5)

scala> countByPartition(rdd.partitionBy(new HashPartitioner(12))).collect
res26: Array[Int] = Array(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)

I have borrowed the example and helper from zero323's answer about How does HashPartitioner work?

I hope this helps !

EDIT:

If you would have done the following :

scala> val rdd = sc.parallelize(for { x <- 1 to 36 } yield (x, None), 12) 
rdd: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[36] at parallelize at <console>:29

scala> countByPartition(rdd).collect
res28: Array[Int] = Array(4, 5, 4, 5, 4, 5, 4, 5)

Results wouldn't be necessarily the same.

eliasah
  • 35,948
  • 8
  • 110
  • 142