-1

What is the best practice to iterate through an RDD in Spark getting both the previous and the current element? The same as the reduce function but returning and RDD instead of a single value.

For instance, given:

val rdd = spark.sparkContext.textFile("date_values.txt").
          map {
             case Array(val1, val2, val3) =>
                Element(DateTime.parse(val1), val2.toDouble)
          }

The output should be a new RDD with the differences in val2 attributes:

Diff(date, current.val2 - previous.val2)

With the map function I can only get the current element, and with the reduce function I can only return 1 element not and RDD. I could use the foreach function saving in temporal variables the previous value but I don't think this would follow the Scala-Spark guidelines.

What do you think is the most appropriate way to handle this?

methk
  • 121
  • 7
  • See https://stackoverflow.com/questions/34146907/operate-on-neighbor-elements-in-rdd-in-spark – sachav Jan 15 '20 at 10:57

1 Answers1

0

The answer given by Dominic Egger in this thread is what I was looking for:

Spark find previous value on each iteration of RDD

import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)

or using Developer API:

val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)
methk
  • 121
  • 7