0

I have a huge time series data set. I want to do reduceByKey on that data. I am using loadAPI joinWithCassandraTable and then doing reduceByKey. The problem is spark loads all the data into memory and then do reduceByKey. I want to reduce the memory, so I tried loading small (say 4 samples out of 24) and then do reduceByKey. And then load subsequent data and use the previous rdd and do reduceByKey.

Sample Code:

def loadData(t1:Long, t2:Long): RDD[(String,Long)] =???
def ranges:List(Long,Long) = ???
ranges.aggregate(sc.emptyRDD[(String, Long)])(loadData, combineRDD)

val combineRDD: (RDD[(String, Long)],
     RDD[(String, Long)]) =>
      RDD[(String, Long)] = {
      case (x, y) =>
        x.union(y).reduceByKey(_+_)
    }

But the above code is not working as expected. It creates long lineage and still loading the entire dataset.

Tried the solution proposed here: sc.union but this doesn't help in memory issue.

Reference: Stackoverflow due to long RDD Lineage

How to merge the RDD to the previously created RDD and then do reduceByKey? EDIT:

I tried using localCheckPoint API as mentioned here.

Stackoverflow due to long RDD Lineage

But it is taking very long time. Is there any better way?

Knight71
  • 2,615
  • 5
  • 36
  • 56

0 Answers0