I have a huge time series data set. I want to do reduceByKey on that data. I am using loadAPI joinWithCassandraTable
and then doing reduceByKey. The problem is spark loads all the data into memory and then do reduceByKey. I want to reduce the memory, so I tried loading small (say 4 samples out of 24) and then do reduceByKey. And then load subsequent data and use the previous rdd and do reduceByKey.
Sample Code:
def loadData(t1:Long, t2:Long): RDD[(String,Long)] =???
def ranges:List(Long,Long) = ???
ranges.aggregate(sc.emptyRDD[(String, Long)])(loadData, combineRDD)
val combineRDD: (RDD[(String, Long)],
RDD[(String, Long)]) =>
RDD[(String, Long)] = {
case (x, y) =>
x.union(y).reduceByKey(_+_)
}
But the above code is not working as expected. It creates long lineage and still loading the entire dataset.
Tried the solution proposed here: sc.union
but this doesn't help in memory issue.
Reference: Stackoverflow due to long RDD Lineage
How to merge the RDD to the previously created RDD and then do reduceByKey?
EDIT:
I tried using localCheckPoint API as mentioned here.
Stackoverflow due to long RDD Lineage
But it is taking very long time. Is there any better way?