1

Hi I've been using Jena for a project and now I am trying to query a Graph for storage in plain files for batch processing with Hadoop.

I open a TDB Dataset and then I query by pages with LIMIT and OFFSET.

I output files with 100000 triplets per file.

However at file 10th the performance degrades and at file 15th it goes down by a factor of 3 and at the 22th file the performances is down to 1%.

My query is:

SELECT DISTINCT ?S ?P ?O WHERE {?S ?P ?O .} LIMIT 100000 OFFSET X

The method that queries and writes to a file is shown in the next code block:

public boolean copyGraphPage(int size, int page, String tdbPath, String query, String outputDir, String fileName) throws IllegalArgumentException {
        boolean retVal = true;
        if (size == 0) {
            throw new IllegalArgumentException("The size of the page should be bigger than 0");
        }
        long offset = ((long) size) * page;
        Dataset ds = TDBFactory.createDataset(tdbPath);
        ds.begin(ReadWrite.READ);
        String queryString = (new StringBuilder()).append(query).append(" LIMIT " + size + " OFFSET " + offset).toString();
        QueryExecution qExec = QueryExecutionFactory.create(queryString, ds);
        ResultSet resultSet = qExec.execSelect();
        List<String> resultVars;
        if (resultSet.hasNext()) {
            resultVars = resultSet.getResultVars();
            String fullyQualifiedPath = joinPath(outputDir, fileName, "txt");
            try (BufferedWriter bwr = new BufferedWriter(new OutputStreamWriter(new BufferedOutputStream(
                    new FileOutputStream(fullyQualifiedPath)), "UTF-8"))) {
                while (resultSet.hasNext()) {
                    QuerySolution next = resultSet.next();
                    StringBuffer sb = new StringBuffer();
                    sb.append(next.get(resultVars.get(0)).toString()).append(" ").
                            append(next.get(resultVars.get(1)).toString()).append(" ").
                            append(next.get(resultVars.get(2)).toString());
                    bwr.write(sb.toString());
                    bwr.newLine();
                }
                qExec.close();
                ds.end();
                ds.close();
                bwr.flush();
            } catch (IOException e) {
                e.printStackTrace();
            }
            resultVars = null;
            qExec = null;
            resultSet = null;
            ds = null;
        } else {
            retVal = false;
        }
        return retVal;
    }

The null variables are there because I didn't know if there was a possible leak in there.

However after the 22th file the program fails with the following message:

java.lang.OutOfMemoryError: GC overhead limit exceeded

    at org.apache.jena.ext.com.google.common.cache.LocalCache$EntryFactory$2.newEntry(LocalCache.java:455)
    at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.newEntry(LocalCache.java:2144)
    at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.put(LocalCache.java:3010)
    at org.apache.jena.ext.com.google.common.cache.LocalCache.put(LocalCache.java:4365)
    at org.apache.jena.ext.com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:5077)
    at org.apache.jena.atlas.lib.cache.CacheGuava.put(CacheGuava.java:76)
    at org.apache.jena.tdb.store.nodetable.NodeTableCache.cacheUpdate(NodeTableCache.java:205)
    at org.apache.jena.tdb.store.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:129)
    at org.apache.jena.tdb.store.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:82)
    at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
    at org.apache.jena.tdb.store.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)
    at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
    at org.apache.jena.tdb.solver.BindingTDB.get1(BindingTDB.java:122)
    at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
    at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
    at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
    at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
    at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
    at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:201)
    at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:183)
    at java.util.HashMap.hash(HashMap.java:338)
    at java.util.HashMap.containsKey(HashMap.java:595)
    at java.util.HashSet.contains(HashSet.java:203)
    at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.getInputNextUnseen(QueryIterDistinct.java:106)
    at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.hasNextBinding(QueryIterDistinct.java:70)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
    at org.apache.jena.sparql.engine.iterator.QueryIterSlice.hasNextBinding(QueryIterSlice.java:76)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)

Disconnected from the target VM, address: '127.0.0.1:57723', transport: 'socket'

Process finished with exit code 255

The memory viewer shows an increment in memory usage after querying a page :

enter image description here

enter image description here

It is clear that Jena LocalCache is filling up, I have changed the Xmx to 2048m and Xms to 512m with the same result. Nothing changed.

Do I need more memory?

Do I need to clear something?

Do I need to stop the program and do it in parts?

Is my query wrong?

Does the OFFSET has anyting to do with it?

I read in some old mail postings that you can turn the cache off but I could not find any way to do it. Is there a way to turn cache off?

I know it is a very difficult question but I appreciate any help.

Nord
  • 243
  • 2
  • 6
  • I think you should consider rewriting this. See the guidelines: https://stackoverflow.com/help/how-to-ask – Kamran Sep 01 '17 at 01:51
  • 1
    `2048m` is not that much nowadays. Why can't you simply increase it? And which Jena version do you use? – UninformedUser Sep 01 '17 at 06:14
  • 1
    @Nord, just a minor comment: perhaps you do not need `DISTINCT`. – Stanislav Kralin Sep 01 '17 at 07:02
  • 1
    @AKSW I used version 2.14 and 3.4 the messages is a little different in both versions but the same error appears. The messages in the question are from version 3.4. I dropped the distinct and increased memory and I'm alreadey on page 200th. Thanks for the help. – Nord Sep 01 '17 at 20:02
  • Thanks @StanislavKralin I did dropped the DISTINCT. I do not know if the graph has any duplicates but I believe it is not worth the cost. Since I will analyze the graph with Hadoop I will check for duplicates at that step. – Nord Sep 01 '17 at 20:28
  • 1
    @Nord, a graph is a *set* of triples. By definition, a graph doesn't have duplicates. But to eliminate duplicates from results, you have to know what's already been output, and that could be expensive. – Joshua Taylor Sep 06 '17 at 12:01

2 Answers2

4

It is clear that Jena LocalCache is filling up

This is the TDB node cache - it usually needs 1.5G (2G is better) per dataset itself. This cache persists for the lifetime of the JVM.

A java heap of 2G is a small Java heap by today's standards. If you must use a small heap, you can try running in 32 bit mode (called "Direct mode" in TDB) but this is less performant (mainly because the node cache is smaller and in this dataset you do have enough nodes to cause cache churn for a small cache).

The node cache is the main cause of the heap exhaustion but the query is consuming memory elsewhere, per query, in DISTINCT.

DISTINCT is not necessarily cheap. It needs to remember everything it has seen to know whether a new row is the first occurrence or already seen.

Apache Jena does optimize some cases of (a TopN query) but the cutoff for the optimization is 1000 by default. See OpTopN in the code.

Otherwise it is collecting all the rows seen so far. The further through the dataset you go, the more that is in the node cache and also the more than is in the DISTINCT filter.

Do I need more memory?

Yes, more heap. The sensible minimum is 2G per TDB dataset and then whatever Java itself requires (say, 0.5G) and plus your program and query workspace.

AndyS
  • 14,989
  • 15
  • 20
  • Thanks, I saw something similar to what you are describing, I query page 21 and it started slow, perhaps I'll drop the DISTINCT since after hadoop mapper the reduce step will allow me to detect duplicates, I'll try your solution with 4GB and 6GB and later 12GB when I get to my main computer. – Nord Sep 01 '17 at 19:18
  • I changed the memory to 4G and dropped the `DISTINCT` and it runs fine. I'll check for duplicates at the MapReduce step. I thought 2GB was enough for any related task for TDB I didn’t find it in the documentation. Thanks. – Nord Sep 01 '17 at 19:45
0

You seem to have memory leak somewhere, this is just a guess, but try this:

TDBFactory.release(ds);

REF: https://jena.apache.org/documentation/javadoc/tdb/org/apache/jena/tdb/TDBFactory.html#release-org.apache.jena.query.Dataset-

Irwan Hendra
  • 150
  • 10
  • 1
    This is unlikely to help in this case as the node cache will fill up again. This is the same query each time, going different lengths though the database in the same order. The node cache is going to fill up in the same way on the longer query and explode at about the same point. – AndyS Sep 01 '17 at 09:48