2

I'm trying to implement a function responsible for returning the intersection of two RDD by comparing a given property.

  def intersect[T](left: RDD[Article], right: RDD[Article])(by: Article => (T,Article)) = {
    val a: RDD[(T, Article)] = left.map(by)
    val b: RDD[(T, Article)] = right.map(by)
    a.join(b).map { case (attr, (leftItem, rightItem)) => leftItem }
  }

However, during compilation, sbt throws the following error :

Error:(128, 7) value join is not a member of org.apache.spark.rdd.RDD[(T, org.example.Article)]
    a.join(b).map { case (attr, (leftItem, rightItem)) => leftItem }
      ^

If I hardcode the type, everything goes fine. Any idea why I have this error ?

UPDATE

It seems that scala is not able to make the implicit conversion from RDD[(T, Article)] to PairRDDFunctions[K, V], but I have no idea why.

UPDATE

If I modify the code like this :

  def intersect[T](left: RDD[Article], right: RDD[Article])(by: Article => (T,Article)) = {
    val a: PairRDDFunctions[T, Article] = left.map(by)
    val b: RDD[(T, Article)] = right.map(by)
    a.join(b).map { case (attr, (leftItem, rightItem)) => leftItem }
  }

I get another error :

[error]  No ClassTag available for T
[error]     val a: PairRDDFunctions[T, Article] = left.map(by)
Francis Toth
  • 1,445
  • 1
  • 10
  • 20

2 Answers2

5

Finally, I've managed to solve this by using a ClassTag. Just like in Java, types are erased at runtime, therefore the compiler is not able to make sure that an RDD(T,P) can be implicitly converted to another RDD(T,P). To fix that, we can use a ClassTag which is basically syntaxic sugar for keeping a type information during runtime :

  def intersect[T:ClassTag](left: RDD[Article], right: RDD[Article])(by: Article => T) = {
    val a: RDD[(T, Article)] = left.map(t => (by(t),t))
    val b: RDD[(T, Article)] = right.map(t => (by(t),t))
    a.join(b).map { case (attr, (leftItem, rightItem)) => leftItem }
  }

We can even put this in an implicit :

implicit class RichRDD[T:ClassTag](rdd: RDD[T]) {
    def intersect[P:ClassTag](that: RDD[T])(by: T => P) = {
        val a: RDD[(P, T)] = rdd.map(t => (by(t),t))
        val b: RDD[(P, T)] = that.map(t => (by(t),t))
        a.join(b).map { case (attr, (leftItem, rightItem)) => leftItem 
    }
}
Francis Toth
  • 1,445
  • 1
  • 10
  • 20
1

Also to complete the code snippet is necessary the following so implicits for PairedRdds are included:

import org.apache.spark.SparkContext._

Alternatively you can write:

  def intersect[T](left: RDD[Article], right: RDD[Article])(by: Article => (T,Article))
(implicit kt: ClassTag[T]) = {
    ...
      }
skonto
  • 81
  • 1
  • 7