Hi I've been using Jena for a project and now I am trying to query a Graph for storage in plain files for batch processing with Hadoop.
I open a TDB Dataset
and then I query by pages with LIMIT and OFFSET.
I output files with 100000 triplets per file.
However at file 10th the performance degrades and at file 15th it goes down by a factor of 3 and at the 22th file the performances is down to 1%.
My query is:
SELECT DISTINCT ?S ?P ?O WHERE {?S ?P ?O .} LIMIT 100000 OFFSET X
The method that queries and writes to a file is shown in the next code block:
public boolean copyGraphPage(int size, int page, String tdbPath, String query, String outputDir, String fileName) throws IllegalArgumentException {
boolean retVal = true;
if (size == 0) {
throw new IllegalArgumentException("The size of the page should be bigger than 0");
}
long offset = ((long) size) * page;
Dataset ds = TDBFactory.createDataset(tdbPath);
ds.begin(ReadWrite.READ);
String queryString = (new StringBuilder()).append(query).append(" LIMIT " + size + " OFFSET " + offset).toString();
QueryExecution qExec = QueryExecutionFactory.create(queryString, ds);
ResultSet resultSet = qExec.execSelect();
List<String> resultVars;
if (resultSet.hasNext()) {
resultVars = resultSet.getResultVars();
String fullyQualifiedPath = joinPath(outputDir, fileName, "txt");
try (BufferedWriter bwr = new BufferedWriter(new OutputStreamWriter(new BufferedOutputStream(
new FileOutputStream(fullyQualifiedPath)), "UTF-8"))) {
while (resultSet.hasNext()) {
QuerySolution next = resultSet.next();
StringBuffer sb = new StringBuffer();
sb.append(next.get(resultVars.get(0)).toString()).append(" ").
append(next.get(resultVars.get(1)).toString()).append(" ").
append(next.get(resultVars.get(2)).toString());
bwr.write(sb.toString());
bwr.newLine();
}
qExec.close();
ds.end();
ds.close();
bwr.flush();
} catch (IOException e) {
e.printStackTrace();
}
resultVars = null;
qExec = null;
resultSet = null;
ds = null;
} else {
retVal = false;
}
return retVal;
}
The null variables are there because I didn't know if there was a possible leak in there.
However after the 22th file the program fails with the following message:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.jena.ext.com.google.common.cache.LocalCache$EntryFactory$2.newEntry(LocalCache.java:455)
at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.newEntry(LocalCache.java:2144)
at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.put(LocalCache.java:3010)
at org.apache.jena.ext.com.google.common.cache.LocalCache.put(LocalCache.java:4365)
at org.apache.jena.ext.com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:5077)
at org.apache.jena.atlas.lib.cache.CacheGuava.put(CacheGuava.java:76)
at org.apache.jena.tdb.store.nodetable.NodeTableCache.cacheUpdate(NodeTableCache.java:205)
at org.apache.jena.tdb.store.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:129)
at org.apache.jena.tdb.store.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:82)
at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
at org.apache.jena.tdb.store.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)
at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
at org.apache.jena.tdb.solver.BindingTDB.get1(BindingTDB.java:122)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:201)
at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:183)
at java.util.HashMap.hash(HashMap.java:338)
at java.util.HashMap.containsKey(HashMap.java:595)
at java.util.HashSet.contains(HashSet.java:203)
at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.getInputNextUnseen(QueryIterDistinct.java:106)
at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.hasNextBinding(QueryIterDistinct.java:70)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIterSlice.hasNextBinding(QueryIterSlice.java:76)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
Disconnected from the target VM, address: '127.0.0.1:57723', transport: 'socket'
Process finished with exit code 255
The memory viewer shows an increment in memory usage after querying a page :
It is clear that Jena LocalCache is filling up, I have changed the Xmx to 2048m and Xms to 512m with the same result. Nothing changed.
Do I need more memory?
Do I need to clear something?
Do I need to stop the program and do it in parts?
Is my query wrong?
Does the OFFSET has anyting to do with it?
I read in some old mail postings that you can turn the cache off but I could not find any way to do it. Is there a way to turn cache off?
I know it is a very difficult question but I appreciate any help.