5

We need to efficiently convert large lists of key/value pairs, like this:

val providedData = List(
        (new Key("1"), new Val("one")),
        (new Key("1"), new Val("un")),
        (new Key("1"), new Val("ein")),
        (new Key("2"), new Val("two")),
        (new Key("2"), new Val("deux")),
        (new Key("2"), new Val("zwei"))
)

into lists of values per key, like this:

val expectedData = List(
  (new Key("1"), List(
    new Val("one"), 
    new Val("un"), 
    new Val("ein"))),
  (new Key("2"), List(
    new Val("two"), 
    new Val("deux"), 
    new Val("zwei")))
)

The key value pairs are from a large key/value store (Accumulo), so the keys will be sorted, but will usually cross spark partition boundaries. There can be millions of keys and hundreds of values per key.

I think the right tool for this job is spark's combineByKey operation, but have been only able to find terse examples with generic types (like Int), that I've been unable to generalize to user-defined types such as above.

Since I suspect many others will have the same question, I'm hoping someone can provide both fully-specified (verbose) and terse examples of the scala syntax for using combineByKey with user-defined types as above, or possibly point out a better tool that I've missed.

Christian Strempfer
  • 6,937
  • 5
  • 43
  • 73
Bradjcox
  • 1,468
  • 1
  • 17
  • 30

1 Answers1

4

I'm not really a Spark expert, but based on this question, I think you can do the following:

val rdd = sc.parallelize(providedData)

rdd.combineByKey(
    // createCombiner: add first value to a list
    (x: Val) => List(x),
    // mergeValue: add new value to existing list
    (acc: List[Val], x) => x :: acc,
    // mergeCominber: combine the 2 lists
    (acc1: List[Val], acc2: List[Val]) => acc1 ::: acc2
)

Using aggregateByKey:

rdd.aggregateByKey(List[Val]())(
    (acc, x) => x :: acc,
    (acc1, acc2) => acc1 ::: acc2
)
Community
  • 1
  • 1
Peter Neyens
  • 9,452
  • 21
  • 32
  • Hmmm; I'm geting this: scala> r.collect res4: Array[(Key, List[Val])] = Array((1,List(ein)), (1,List(one)), (1,List(un)), (2,List(deux)), (2,List(two)), (2,List(zwei))) – Bradjcox Jun 20 '15 at 14:13
  • @Bradjcox Have you implemented `Key` as a case class or a normal class? In the latter instance, you should override the `equals` method. Try `case class Key(key: String)` – Peter Neyens Jun 20 '15 at 14:27
  • Here is the Key class I'm using: @SerialVersionUID(123L) case class Key(v: String) extends Serializable { val n = v; override def toString: String = { return n; } override def equals(o: Any) = o match { case that: Key => that.n.equals(this.n) case _ => false } override def hashCode = n.hashCode } – Bradjcox Jun 20 '15 at 14:39
  • Sorry for the false alarm. Its working fine as compiled code, just not in the REPL. I'm good now. Thanks! – Bradjcox Jun 20 '15 at 16:36
  • 1
    In this case, an `aggregateByKey` would work. No need to go with the more complex signature. – Justin Pihony Jun 21 '15 at 05:42
  • Hmmm, the new code doesn't compile in eclipse: type mismatch; found: List(Val) required Nil.type. And how would it acquire a List to extend? How is aggregateByKey better than combineByKey? – Bradjcox Jun 21 '15 at 10:12
  • @Bradjcox I have updated my answer. `aggregateByKey` actually uses `combineByKey` [underneath](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L124-137). – Peter Neyens Jun 21 '15 at 10:53
  • Something's still wrong with the aggregateByKey example. But no biggie; I'm good with combineByKey
    (1,one) (2,deux) (2,zwei) (1,un) (1,ein) (2,two)
    – Bradjcox Jun 21 '15 at 12:17