How is Apache Spark WebUI reporting the split between in-progress RDD vs. "Storage" on App Detail?

Question

Having looked at What happens if an RDD can't fit into memory in Spark? and Storagelevel in spark RDD MEMORY_AND_DISK_2() throw exception I still have a question:

Is "storage" on the Executor tab tracking data in a RDD as it's built? Or only once that RDD is persisted to memory or disk per the cache settings? If you look at the screenshot lower in this post, there are about 100,000 objects built but it shows zero memory being used.

Want more detail? Building an RDD called "dataSetN" which is simply an RDD of integers of the desired size, (typically) 10^6, then mapping that to an RDD of our "SparkDoDrops":

JavaRDD<TholdDropResult> dropResultsN = dataSetN.map(s -> new SparkDoDrop(parFile, pedFile).call(s)).persist(StorageLevel.MEMORY_ONLY());

Posting only selections of code, we actually do something with these "dropResultsN" RDD members so I know Spark is not being lazy:

        for (Accumulable<TholdDropTuple, TholdDropResult> dropEvalAccum : dropAccumList) {
            dropResultsN.foreach(new VoidFunction<TholdDropResult>() {
                @Override
                public void call(TholdDropResult dropResultFromN) throws Exception {
                    dropEvalAccum.add(dropResultFromN);
                }
            });

When looking at the Executor tab in Application Detail:

At the point this screen was shot, over 100,000 Java objects had been built and stored in a RDD, which I know because of logging every time 1000 objects are created.

Finally, when I upped the (single for testing) executor to 17.5G it almost made it to 100K and I got this error. In the last run with 12G it kept going back and redoing what it lost and appeared it would never finish:

16/06/19 19:47:59 INFO SparkDoDrop: Hit a 100 milestone on N = 90900 
16/06/19 19:48:06 INFO MemoryStore: Will not store rdd_1_1 as it would require dropping another block from the same RDD 
16/06/19 19:48:06 WARN MemoryStore: Not enough space to cache rdd_1_1 in memory! (computed 5.2 GB so far) 
16/06/19 19:48:06 INFO MemoryStore: Memory use = 6.4 GB (blocks) + 5.2 GB (scratch space shared across 2 tasks(s)) = 11.6 GB. Storage limit = 12.1 GB. 
16/06/19 19:48:14 INFO SparkDoDrop: Hit a 100 milestone on N = 91000

My idea I will test tomorrow is to turn off caching/persist. Thanks for any input!

BTW I acknowledge I am doing something weird by creating an array of Accumulables (n= 1000 to 2000 for our application) I intend to rewrite this in Scala in my spare time (hahaha) because Scala's SparkContext allows a HashMap Accumulable while JavaSparkContext does not. All the same I don't think it wasted too much memory but it's ugly I agree :) — JimLohse, Jun 20 '16 at 01:52
And I do see http://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory and http://stackoverflow.com/questions/34114625/spark-not-enough-space-to-cache-red-in-container-while-still-a-lot-of-total-sto?rq=1 but they don't address my specific question, what is the meaning of memory usage in my screenshot of the Executor tab from the Application detail page? — JimLohse, Jun 20 '16 at 02:10

How is Apache Spark WebUI reporting the split between in-progress RDD vs. "Storage" on App Detail?

0 Answers0