Spark find previous value on each iteration of RDD

Question

I've following code :-

val rdd = sc.cassandraTable("db", "table").select("id", "date", "gpsdt").where("id=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2) , entry(3))
val rddcopy = rdd.sortBy(row => row.get[String]("gpsdt"), false).zipWithIndex()
rddcopy.foreach { records =>
  {
    val previousRow = (records - 1)th row
    val currentRow = records
// Some calculation based on both rows 
    }
}

So, Idea is to get just previous \ next row on each iteration of RDD. I want to calculate some field on current row based on the value present on previous row. Thanks,

Dominic Egger · Answer 1 · 2018-04-09T12:49:17.450

0

EDIT II: Misunderstood question below is how to get tumbling window semantics but sliding window is needed. considering this is a sorted RDD

import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)

should do the trick. Note however that this is using a DeveloperAPI.

alternatively you can

val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)

rdd joins should be inner joins (IIRC) thus dropping the edge cases where the tuples would be partially null

OLD STUFF:

how do you do identify the previous row? RDDs do not have any sort of stable ordering by themselves. if you have an incrementing dense key you could add a new column that get's calculated the following way if (k % 2 == 0) k / 2 else (k-1)/2 this should give you a key that has the same value for two successive keys. Then you could just group by.

But to reiterate there is no really sensible notion of previous in most cases for RDDs (depending on partitioning, datasource etc.)

EDIT: so now that you have a zipWithIndex and an ordering in your set you can do what I mentioned above. So now you have an RDD[(Int, YourData)] and can do

rdd.map( kv => if (kv._1 % 2 == 0) (kv._1 / 2, kv._2) else ( (kv._1 -1) /2, kv._2 ) ).groupByKey.foreach (/* your stuff here /*)

if you reduce at any point consider using reduceByKey rather than groupByKey().reduce

edited Apr 09 '18 at 12:49

answered Apr 09 '18 at 10:33

Dominic Egger

976
4
7

Hi, I've edited my question to get sorted RDD . So , now I hope it is clear what exactly I'm looking for – jAi Apr 09 '18 at 10:39
cool. See edit above on how it might work now. just be aware that this is just back of the napkin stuff and I haven't verified – Dominic Egger Apr 09 '18 at 10:44
Error :- value % is not a member of com.datastax.spark.connector.CassandraRow – jAi Apr 09 '18 at 10:55
you'll probably have to do it on `rddcopy` rather than `rdd` – Dominic Egger Apr 09 '18 at 10:58
Yes I tried on rddcopy only. for rdd it gives an error :- value _1 is not a member of com.datastax.spark.connector.CassandraRow – jAi Apr 09 '18 at 11:05
can you check the type of RDD copy? It might be rather `RDD[(CassandraRow, Int)]` rather than `RDD[(Int, CassandraRow)]` if so you'll have to adjust the code with the correct tuple usage – Dominic Egger Apr 09 '18 at 11:07
val rddcopy: RDD[(CassandraRow, Long)] – jAi Apr 09 '18 at 11:12
ah ok. sorry for some reason I thought `zipWithIndex` would append the `Long` Index on the left instead of the right. just switch around the tuple indices and make sure you return the index on the left in the `map` becasuse `groupByKey` assumes an `RDD[(K,V)]` and will group on `K`. So you'll want an `RDD[(Long, CassandraRow)]]` before calling `groupByKey` not the other way around – Dominic Egger Apr 09 '18 at 11:15
if you look at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions you'll see that after the `groupByKey` you have a `RDD[(K, Iterable[V])]` so in your `foreach` you'll have tuples of `Long` and `Iterable[CassandraRow]` you can work with those `Iterables` as you can with any other it is at that point executor local and a regular scala collection – Dominic Egger Apr 09 '18 at 11:39
Lets say rows are A,B,C,D,E (sorted order) . I want (A,nopair),(B,A),(C,B),(D,C),(E,D) but your query gives result as (A,nopair),(C,B),(E,D). I want successive grouping for each row mapped with its previous row. – jAi Apr 09 '18 at 12:08
I mentioned in my very first question query that I need previous row on each particular iteration – jAi Apr 09 '18 at 12:14
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/168560/discussion-between-aditya-jain-and-dominic-egger). – jAi Apr 09 '18 at 12:17

Spark find previous value on each iteration of RDD

1 Answers1

Linked