Why am I sometimes getting an OOM when getting all documents from a 800MB index with 8GB of heap?

Question

I need to refresh an index governed by SOLR 7.4. I use SOLRJ to access it on a 64 bit Linux machine with 8 CPUs and 32GB of RAM (8GB of heap for the indexing part and 24GB for SOLR server). The index to be refreshed is around 800MB in size and counts around 36k documents (according to Luke).

Before starting the indexing process itself, I need to "clean" the index and remove the Documents that do not match an actual file on disk (e.g : a document had been indexed previously and has moved since then, so user won't be able to open it if it appears on the result page).

To do so I first need to get the list of Document in index :

final SolrQuery query = new SolrQuery("*:*"); // Content fields are not loaded to reduce memory footprint
        query.addField(PATH_DESCENDANT_FIELDNAME); 
        query.addField(PATH_SPLIT_FIELDNAME);
        query.addField(MODIFIED_DATE_FIELDNAME);
        query.addField(TYPE_OF_SCANNED_DOCUMENT_FIELDNAME);
        query.addField("id");
        query.setRows(Integer.MAX_VALUE); // we want ALL documents in the index not only the first ones

            SolrDocumentList results = this.getSolrClient().
                                               query(query).
                                               getResults(); // This line sometimes gives OOM

When the OOM appears on the production machine, it appears during that "index cleaning" part and the stack trace reads :

Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
at org.noggit.CharArr.resize(CharArr.java:110)
at org.noggit.CharArr.reserve(CharArr.java:116)
at org.apache.solr.common.util.ByteUtils.UTF8toUTF16(ByteUtils.java:68)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:868)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:857)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:266)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java:541)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:305)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:747)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:272)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:555)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:307)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:200)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:274)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:178)
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:50)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:614)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:942)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:957)

I've aleady removed the content fields from the query because there were already OOMs, so I thought only storing "small" data would avoid OOMs, but they are still there. Moreover as I started the project for the customer we had only 8GB of RAM (so heap of 2GB), then we increased it to 20GB (heap of 5GB), and now to 32GB (heap of 8GB) and the OOM still appears, although the index is not that large compared to what is described in other SO questions (featuring millions of documents).

Please note that I cannot reproduce it on my dev machine less powerful (16GB RAM so 4GB of heap) after copying the 800 MB index from the production machine to my dev machine.

So to me there could be a memory leak. That's why I followed Netbeans post on Memory Leaks on my dev machine with the 800MB index. From what I see I guess there is a memory leak since indexing after indexing the number of surviving generation keeps increasing during the "index cleaning" (steep lines below) :

What should I do, 8GB of heap is already a huge quantity heap compared to the index characteristics ? So increasing the heap does not seem to make sense because the OOM only appears during the "index cleaning" not while actually indexing large documents, and it seems to be caused by the surviving generations, doesn't it ? Would creating a query object and then applying getResults on it would help the Garbage COllector ?

Is there another method to get all document paths ? Or maybe retrieving them chunk by chunk (pagination) would help even for that small amount of documents ?

Any help appreciated

The heap size is not necessarily (in fact, it's not) the same as the total amount of RAM available on your machine. You can adjust the heap size by giving the `-Xmx` option when running your application. Exactly how much depends on whether that property is set by default in your environment or if its using the default size. See https://stackoverflow.com/a/13871564/137650 for how to find out what the current setting is. — MatsLindh, Jun 09 '20 at 22:19
Thanks @MatsLindh, I forgot to mention the heap size (8GB). To avoid confusion I changed the title and removed the RAM term. I think I made a mistake somewhere which turns into a memory leak (I added a graph about it). I am very surprised to get an OOM here because I carefully avoid loading the large contents fields (I don't add them to the query). It's not as though the OOM appeared during the processing of a very large documents with millions of characters in it. — HelloWorld, Jun 10 '20 at 05:31
If you attach a profiler you should be able to see exactly which areas of memory is being kept - it'll at least help narrowing down the possible sources that can keep documents around for longer (and maybe help you discover why this happens on one machine and not the other) — MatsLindh, Jun 10 '20 at 09:50
Ok, I'll try to improve my profiling skills (this is my first use of a a profiler). This morning after a reboot of the production machine, everything ran smoothly with a around 8GB of RAM used (yesterday it was around 30GB, after months of run time without reboot). — HelloWorld, Jun 10 '20 at 12:41
Sounds like there might be an underlying memory leak somewhere in that case. Which version of Solr and the JVM are you running on? — MatsLindh, Jun 10 '20 at 16:52

HelloWorld · Accepted Answer · 2020-06-28T04:46:34.803

After a while I finally came across this post. It exactly describe my issue

An out of memory (OOM) error typically occurs after a query comes in with a large rows parameter. Solr will typically work just fine up until that query comes in.

So they advice (emphasize is mine):

The rows parameter for Solr can be used to return more than the default of 10 rows. I have seen users successfully set the rows parameter to 100-200 and not see any issues. However, setting the rows parameter higher has a big memory consequence and should be avoided at all costs.

And this is what I see while retrieving 100 results per page :

The number of surviving generations has decreased dramatically although garbage collector's activity is much more intensive and computation time is way greater. But if this is the cost for avoiding OOM this is OK (see the program looses some seconds per index updates which can last several hours) !

Increasing the number of rows to 500 already makes the memory leak happens again (number of surviving generations increasing) :

Please note that setting the row number to 200 did not cause the number of surviving generations to increase a lot (I did not measure it), but did not perform much better in my test case (less than 2%) than the "100" setting :

So here is the code I used to retrieve all documents from an index (from Solr's wiki) :

SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (! done) {
 q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
 QueryResponse rsp = solrServer.query(q);
 String nextCursorMark = rsp.getNextCursorMark();
 doCustomProcessingOfResults(rsp);
 if (cursorMark.equals(nextCursorMark)) {
  done = true;
 }
cursorMark = nextCursorMark;
}

TL;DR : Don't use a number too large for query.setRows ie not greater than 100-200 as a higher number may very much likely cause an OOM.

Why am I sometimes getting an OOM when getting all documents from a 800MB index with 8GB of heap?

1 Answers1