28

I am new to Spark. I have a large dataset of elements[RDD] and I want to divide it into two exactly equal sized partitions maintaining order of elements. I tried using RangePartitioner like

var data = partitionedFile.partitionBy(new RangePartitioner(2, partitionedFile))

This doesn't give a satisfactory result because it divides roughly but not exactly equal sized maintaining order of elements. For example if there are 64 elements, we use Rangepartitioner, then it divides into 31 elements and 33 elements.

I need a partitioner such that I get exactly first 32 elements in one half and other half contains second set of 32 elements. Could you please help me by suggesting how to use a customized partitioner such that I get equally sized two halves, maintaining the order of elements?

zero323
  • 283,404
  • 79
  • 858
  • 880
yh18190
  • 379
  • 1
  • 4
  • 7
  • Hi! Where are you calling partitionBy, I can't find this method in the spark documentation. After I define a new partitioner, how do i partition an existing RDD into a new set of partitions? Thanks! – Max Song May 13 '14 at 19:58
  • `partitionBy` is in [PairRDDFunctions]( http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions), so you can call it on any `RDD[K,V]`. There are a bunch of essential methods hidden in this class, check it out! – Daniel Darabos May 13 '14 at 22:05
  • Thanks Daniel! Will check it out for sure. – Max Song May 13 '14 at 22:41
  • Good question, I used to use `CoalescedRDD`, but they made it private in 1.0.0 – samthebest Aug 07 '14 at 20:30

3 Answers3

26

Partitioners work by assigning a key to a partition. You would need prior knowledge of the key distribution, or look at all keys, to make such a partitioner. This is why Spark does not provide you with one.

In general you do not need such a partitioner. In fact I cannot come up with a use case where I would need equal-size partitions. What if the number of elements is odd?

Anyway, let us say you have an RDD keyed by sequential Ints, and you know how many in total. Then you could write a custom Partitioner like this:

class ExactPartitioner[V](
    partitions: Int,
    elements: Int)
  extends Partitioner {

  def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[Int]
    // `k` is assumed to go continuously from 0 to elements-1.
    return k * partitions / elements
  }
}
Daniel Darabos
  • 25,678
  • 9
  • 94
  • 106
  • Thanks for the response Daniel.It worked.I am working on a algorithm which has even number of elements in dataset. – yh18190 Apr 25 '14 at 07:33
  • 7
    Once you define this new class, where do you use to call this? Partitioner in RDD is a val, and I can't change it, if I define a new RDD with this custom Partitioner, how do I create it with a method? – Max Song May 13 '14 at 19:59
  • 1
    Skew can introduce additional processing time. It is introduced by having one executor more tasks pending than another, or because partitions are not equally sized (one task runs longer than the other). I would generally say to overschedule more (much more than there are cores available) and smaller partitions so the skew disappears into the noise. Better than trying to exactly match tasks on a single executor. – YoYo Apr 28 '17 at 19:28
  • 7
    Note that there is another way to influence how you partition. By default it is using a [`HashPartitioner`](http://stackoverflow.com/q/31424396/744133), so by overriding your [`hashCode`](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#hashCode--) method, you also directly influence the partitioning. – YoYo Apr 28 '17 at 19:35
  • @Danial the solution does work. But how will the key be assigned to a partition when I have key as '4' and no. of partitions as '2' and no. of elements '4' for this case? Bcoz with given params, the 'PartitionID' will become 2 (2*4/4) and we have to have only 2 partitions namely '0' and '1'? – DrthSprk_ Jan 15 '20 at 08:03
  • If you have 4 keys, they are 0, 1, 2, 3. – Daniel Darabos Jan 15 '20 at 09:55
12

This answer has some inspiration from Daniel, but provides a full implementation (using pimp my library pattern) with an example for peoples copy and paste needs :)

import RDDConversions._

trait RDDWrapper[T] {
  def rdd: RDD[T]
}

// TODO View bounds are deprecated, should use context bounds
// Might need to change ClassManifest for ClassTag in spark 1.0.0
case class RichPairRDD[K <% Ordered[K] : ClassManifest, V: ClassManifest](
  rdd: RDD[(K, V)]) extends RDDWrapper[(K, V)] {
  // Here we use a single Long to try to ensure the sort is balanced, 
  // but for really large dataset, we may want to consider
  // using a tuple of many Longs or even a GUID
  def sortByKeyGrouped(numPartitions: Int): RDD[(K, V)] =
    rdd.map(kv => ((kv._1, Random.nextLong()), kv._2)).sortByKey()
    .grouped(numPartitions).map(t => (t._1._1, t._2))
}

case class RichRDD[T: ClassManifest](rdd: RDD[T]) extends RDDWrapper[T] {
  def grouped(size: Int): RDD[T] = {
    // TODO Version where withIndex is cached
    val withIndex = rdd.mapPartitions(_.zipWithIndex)

    val startValues =
      withIndex.mapPartitionsWithIndex((i, iter) => 
        Iterator((i, iter.toIterable.last))).toArray().toList
      .sortBy(_._1).map(_._2._2.toLong).scan(-1L)(_ + _).map(_ + 1L)

    withIndex.mapPartitionsWithIndex((i, iter) => iter.map {
      case (value, index) => (startValues(i) + index.toLong, value)
    })
    .partitionBy(new Partitioner {
      def numPartitions: Int = size
      def getPartition(key: Any): Int = 
        (key.asInstanceOf[Long] * numPartitions.toLong / startValues.last).toInt
    })
    .map(_._2)
  }
}

Then in another file we have

// TODO modify above to be implicit class, rather than have implicit conversions
object RDDConversions {
  implicit def toRichRDD[T: ClassManifest](rdd: RDD[T]): RichRDD[T] = 
    new RichRDD[T](rdd)
  implicit def toRichPairRDD[K <% Ordered[K] : ClassManifest, V: ClassManifest](
    rdd: RDD[(K, V)]): RichPairRDD[K, V] = RichPairRDD(rdd)
  implicit def toRDD[T](rdd: RDDWrapper[T]): RDD[T] = rdd.rdd
}

Then for your use case you just want (assuming it's already sorted)

import RDDConversions._

yourRdd.grouped(2)

Disclaimer: Not tested, kinda just wrote this straight into the SO answer

samthebest
  • 28,224
  • 21
  • 93
  • 129
  • Where is this "partitionBy" method? I see it only in JavaRDD not in the scala RDD. Update: OK found it in the PairRDDFunctions (included by implicits) – StephenBoesch Jan 07 '15 at 20:00
0

In newer version of Spark you could write your own Partitioner and make use of the method zipWithIndex

The idea is to

  • index your RDD
  • use the index as Key
  • Apply custom Partitioner based on the number of required partitions

An example code is shown below:

  // define custom Partitioner Class
  class EqualDistributionPartitioner(numberOfPartitions: Int) extends Partitioner {
    override def numPartitions: Int = numberOfPartitions

    override def getPartition(key: Any): Int = {
      (key.asInstanceOf[Long] % numberOfPartitions).toInt
    }
  }

  // create test RDD (starting with one partition)
  val testDataRaw = Seq(
    ("field1_a", "field2_a"),
    ("field1_b", "field2_b"),
    ("field1_c", "field2_c"),
    ("field1_d", "field2_d"),
    ("field1_e", "field2_e"),
    ("field1_f", "field2_f"),
    ("field1_g", "field2_g"),
    ("field1_h", "field2_h"),
    ("field1_k", "field2_k"),
    ("field1_l", "field2_l"),
    ("field1_m", "field2_m"),
    ("field1_n", "field2_n")
  )
  val testRdd: RDD[(String, String)] = spark.sparkContext.parallelize(testDataRaw, 1)

  // create index
  val testRddWithIndex: RDD[(Long, (String, String))] = testRdd.zipWithIndex().map(msg => (msg._2, msg._1))

  // use index for equally distribution
  // Example with six partitions
  println("Example with 2 partitions:")
  val equallyDistributedPartitionTwo = testRddWithIndex.partitionBy(new EqualDistributionPartitioner(2))
  equallyDistributedPartitionTwo.foreach(k => println(s"Partition: ${TaskContext.getPartitionId()}, Content: $k"))

  println("\nExample with 4 partitions:")
  // Example with four partitions
  val equallyDistributedPartitionFour = testRddWithIndex.partitionBy(new EqualDistributionPartitioner(4))
  equallyDistributedPartitionFour.foreach(k => println(s"Partition: ${TaskContext.getPartitionId()}, Content: $k"))

where spark is your SparkSession.

As output you will get:

Example with 2 partitions:
Partition: 0, Content: (0,(field1_a,field2_a))
Partition: 1, Content: (1,(field1_b,field2_b))
Partition: 0, Content: (2,(field1_c,field2_c))
Partition: 1, Content: (3,(field1_d,field2_d))
Partition: 0, Content: (4,(field1_e,field2_e))
Partition: 1, Content: (5,(field1_f,field2_f))
Partition: 0, Content: (6,(field1_g,field2_g))
Partition: 1, Content: (7,(field1_h,field2_h))
Partition: 0, Content: (8,(field1_k,field2_k))
Partition: 1, Content: (9,(field1_l,field2_l))
Partition: 0, Content: (10,(field1_m,field2_m))
Partition: 1, Content: (11,(field1_n,field2_n))

Example with 4 partitions:
Partition: 0, Content: (0,(field1_a,field2_a))
Partition: 0, Content: (4,(field1_e,field2_e))
Partition: 0, Content: (8,(field1_k,field2_k))
Partition: 3, Content: (3,(field1_d,field2_d))
Partition: 3, Content: (7,(field1_h,field2_h))
Partition: 3, Content: (11,(field1_n,field2_n))
Partition: 1, Content: (1,(field1_b,field2_b))
Partition: 1, Content: (5,(field1_f,field2_f))
Partition: 1, Content: (9,(field1_l,field2_l))
Partition: 2, Content: (2,(field1_c,field2_c))
Partition: 2, Content: (6,(field1_g,field2_g))
Partition: 2, Content: (10,(field1_m,field2_m))
mike
  • 9,910
  • 3
  • 18
  • 43