1

I have a trait called that takes a type parameter, and one of its methods needs to be able to create an empty typed dataset.

trait MyTrait[T] {
    val sparkSession: SparkSession
    val spark = sparkSession.session
    val sparkContext = spark.sparkContext

    def createEmptyDataset(): Dataset[T] = {
        import spark.implicits._ // to access .toDS() function
        // DOESN'T WORK.
        val emptyRDD = sparkContext.parallelize(Seq[T]())
        val accumulator = emptyRDD.toDS()
        ...
    }
}

So far I have not gotten it to work. It complains no ClassTag for T, and that value toDS is not a member of org.apache.spark.rdd.RDD[T]

Any help would be appreciated. Thanks!

pigate
  • 341
  • 1
  • 6
  • 15

1 Answers1

6

You have to provide both ClassTag[T] and Encoder[T] in the same scope. For example:

import org.apache.spark.sql.{SparkSession, Dataset, Encoder}
import scala.reflect.ClassTag


trait MyTrait[T] {
    val ct: ClassTag[T]
    val enc: Encoder[T]

    val sparkSession: SparkSession
    val sparkContext = spark.sparkContext

    def createEmptyDataset(): Dataset[T] = {
        val emptyRDD = sparkContext.emptyRDD[T](ct)
        spark.createDataset(emptyRDD)(enc)
    }
}

with concrete implementation:

class Foo extends MyTrait[Int] {
   val sparkSession = SparkSession.builder.getOrCreate()
   import sparkSession.implicits._

   val ct = implicitly[ClassTag[Int]]
   val enc = implicitly[Encoder[Int]]
}

It is possible to skip RDD:

import org.apache.spark.sql.{SparkSession, Dataset, Encoder}

trait MyTrait[T] {
    val enc: Encoder[T]

    val sparkSession: SparkSession
    val sparkContext = spark.sparkContext

    def createEmptyDataset(): Dataset[T] = {
        spark.emptyDataset[T](enc)
    }
}

Check How to declare traits as taking implicit "constructor parameters"?, specifically answer by Blaisorblade and another one by Alexey Romanov.

Alper t. Turker
  • 29,733
  • 7
  • 65
  • 101