Load data set by breaking down in to small subsets and do reduceByKey to avoid Out of memory in Spark RDD

Question

I have a huge time series data set. I want to do reduceByKey on that data. I am using loadAPI joinWithCassandraTable and then doing reduceByKey. The problem is spark loads all the data into memory and then do reduceByKey. I want to reduce the memory, so I tried loading small (say 4 samples out of 24) and then do reduceByKey. And then load subsequent data and use the previous rdd and do reduceByKey.

Sample Code:

def loadData(t1:Long, t2:Long): RDD[(String,Long)] =???
def ranges:List(Long,Long) = ???
ranges.aggregate(sc.emptyRDD[(String, Long)])(loadData, combineRDD)

val combineRDD: (RDD[(String, Long)],
     RDD[(String, Long)]) =>
      RDD[(String, Long)] = {
      case (x, y) =>
        x.union(y).reduceByKey(_+_)
    }

But the above code is not working as expected. It creates long lineage and still loading the entire dataset.

Tried the solution proposed here: sc.union but this doesn't help in memory issue.

Reference: Stackoverflow due to long RDD Lineage

How to merge the RDD to the previously created RDD and then do reduceByKey? EDIT:

I tried using localCheckPoint API as mentioned here.

Stackoverflow due to long RDD Lineage

But it is taking very long time. Is there any better way?

Load data set by breaking down in to small subsets and do reduceByKey to avoid Out of memory in Spark RDD

0 Answers0