2

I am very new to DBPedia and I don't know how to and from where to start. I did some research on this and from that what I understand is we can access the data using SPARQL query language (Apache Jena). So I started to download the .ttl files for the Ontology Infobox Properties. After that I extracted this file its almost 2GB. Here my problem started None of the editors are unable to open this file. My sample program to access this file is here...

public class OntologyExample {
public static void main(String[] args) {
    FileManager.get().addLocatorClassLoader(
            OntologyExample.class.getClassLoader());
    Model model = FileManager
            .get()
            .loadModel("D:\\Dell XPS\\DBPEDIA\\instance_types_en.ttl\\instance_types_en.ttl");


    String q = "SELECT * WHERE { "
            + "?e <http://dbpedia.org/ontology/series> <http://dbpedia.org/resource/The_Sopranos>  ."
            + "?e <http://dbpedia.org/ontology/releaseDate> ?date"
            + "?e <http://dbpedia.org/ontology/episodeNumber>  ?number   "
            + "?e <http://dbpedia.org/ontology/seasonNumber>   ?season"
            + " }" + "ORDER BY DESC(?date)";

    Query query = QueryFactory.create(q);
    QueryExecution queryExecution = QueryExecutionFactory.create(query,
            model);
    ResultSet resultSet = queryExecution.execSelect();
    ResultSetFormatter.out(System.out, resultSet, query);
    queryExecution.close();
}
}

So the input for this program is that 2GB file. So I just ran this sample program its throwing exception like

Exception in thread "main" com.hp.hpl.jena.n3.turtle.TurtleParseException: GC overhead limit exceeded
at com.hp.hpl.jena.n3.turtle.ParserTurtle.parse(ParserTurtle.java:63)
at com.hp.hpl.jena.n3.turtle.TurtleReader.readWorker(TurtleReader.java:33)
at com.hp.hpl.jena.n3.JenaReaderBase.readImpl(JenaReaderBase.java:119)
at com.hp.hpl.jena.n3.JenaReaderBase.read(JenaReaderBase.java:84)
at com.hp.hpl.jena.rdf.model.impl.ModelCom.read(ModelCom.java:268)
at com.hp.hpl.jena.util.FileManager.readModelWorker(FileManager.java:403)
at com.hp.hpl.jena.util.FileManager.loadModelWorker(FileManager.java:306)
at com.hp.hpl.jena.util.FileManager.loadModel(FileManager.java:258)
at jena.tutorial.OntologyExample.main(OntologyExample.java:18)

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at org.apache.jena.iri.impl.LexerPath.yytext(LexerPath.java:420)
at org.apache.jena.iri.impl.AbsLexer.rule(AbsLexer.java:81)
at org.apache.jena.iri.impl.LexerPath.yylex(LexerPath.java:711)
at org.apache.jena.iri.impl.AbsLexer.analyse(AbsLexer.java:52)
at org.apache.jena.iri.impl.Parser.<init>(Parser.java:108)
at org.apache.jena.iri.impl.IRIImpl.<init>(IRIImpl.java:65)
at org.apache.jena.iri.impl.AbsIRIImpl.create(AbsIRIImpl.java:692)
at org.apache.jena.iri.IRI.resolve(IRI.java:432)
at com.hp.hpl.jena.n3.IRIResolver.resolve(IRIResolver.java:167)
at com.hp.hpl.jena.n3.turtle.ParserBase._resolveIRI(ParserBase.java:198)
at com.hp.hpl.jena.n3.turtle.ParserBase.resolveIRI(ParserBase.java:192)
at com.hp.hpl.jena.n3.turtle.ParserBase.resolveQuotedIRI(ParserBase.java:183)
at com.hp.hpl.jena.n3.turtle.parser.TurtleParser.IRI_REF(TurtleParser.java:737)
at com.hp.hpl.jena.n3.turtle.parser.TurtleParser.IRIref(TurtleParser.java:680)
at com.hp.hpl.jena.n3.turtle.parser.TurtleParser.GraphTerm(TurtleParser.java:496)
at com.hp.hpl.jena.n3.turtle.parser.TurtleParser.VarOrTerm(TurtleParser.java:420)
at com.hp.hpl.jena.n3.turtle.parser.TurtleParser.TriplesSameSubject(TurtleParser.java:150)
at com.hp.hpl.jena.n3.turtle.parser.TurtleParser.Statement(TurtleParser.java:97)
at com.hp.hpl.jena.n3.turtle.parser.TurtleParser.parse(TurtleParser.java:67)
at com.hp.hpl.jena.n3.turtle.ParserTurtle.parse(ParserTurtle.java:49)
... 8 more

I am running this code from my Eclipse and here are my Eclipse .ini preferences.

org.eclipse.epp.package.jee.product
--launcher.defaultAction
openFile
--launcher.XXMaxPermSize
512M
-showsplash
org.eclipse.platform
--launcher.XXMaxPermSize
512m
--launcher.defaultAction
openFile
-vmargs
-Dosgi.requiredJavaVersion=1.5
-Xms1024m
-Xmx2048m

So my problems here is

  1. How can I access this kind of large files.
  2. How can I use DBPedia in a proper manner.

So please help me I am stuck over here. I am doing a project on DBpedia.

Joshua Taylor
  • 80,876
  • 9
  • 135
  • 306
Amar
  • 725
  • 2
  • 12
  • 35
  • Did you check if your query is syntactically correct? For example by trying it via the public SPARQL point provided by DBPedia at http://dbpedia.org/sparql or http://dbpedia.org/snorql/ ? On top of that, it seems like you are trying to query the metadata and not the DBpedia data itself (which contains information about a TV series; DBPedia data is much bigger than 2 GB I believe). – Emre Sevinç Feb 21 '13 at 22:37

2 Answers2

4

You can use Jena's ARQ to run SPARQL queries against DBpedia data, and if you are going to do lots of queries and data processing, it is useful to download the data and work with it locally. to do that, especially, with data as large as DBpedia's, you probably shouldn't try to load it into an in memory model, but use TDB and Fuseki to set up SPARQL endpoint that you can run queries against. This has been discussed for a different dataset in this answer.

However, since you're just getting started, it's probably much easier to work with the public DBpedia SPARQL endpoint. There you can type in SPARQL queries and retrieve results in a variety of formats. The query in your question was a bit malformed, but was easy enough to clean up; the cleaned up and working query follows.

SELECT * WHERE {
    ?e <http://dbpedia.org/ontology/series> <http://dbpedia.org/resource/The_Sopranos>  .
    ?e <http://dbpedia.org/ontology/releaseDate> ?date .
    ?e <http://dbpedia.org/ontology/episodeNumber> ?number .
    ?e <http://dbpedia.org/ontology/seasonNumber> ?season .
}
ORDER BY DESC(?date)

SPARQL results

The DBpedia wiki actually has a whole page about accessing DBpedia online Accessing the DBpedia Data Set over the Web, that will give you some idea of how you can access the data. Another page on the wiki, The DBpedia Data Set will tell you much more about what data is available.

Community
  • 1
  • 1
Joshua Taylor
  • 80,876
  • 9
  • 135
  • 306
2

I'm using TDB, there you also have some command-line tools that are easy to use and much faster than Eclipse. You can download the latest version from their downloads page. You can use tdbloader2 to load the .ttl file into the store and then query it with tdbquery also on the command-line or in Eclipse, just as you do now:

Dataset dataset = TDBFactory.createDataset( "path" );
Query query = QueryFactory.create( "SELECT * WHERE { "
        + "?e <http://dbpedia.org/ontology/series> <http://dbpedia.org/resource/The_Sopranos>  ."
        + "?e <http://dbpedia.org/ontology/releaseDate> ?date ."
        + "?e <http://dbpedia.org/ontology/episodeNumber>  ?number  . "
        + "?e <http://dbpedia.org/ontology/seasonNumber>   ?season ."
        + " }" + "ORDER BY DESC(?date)" );
QueryExecution qexec = QueryExecutionFactory.create( query, dataset );
ResultSet results = qexec.execSelect();

As far as I understand you should make a . after every SPARQL-triple, only the last one is optional. It could be that the command-line tools run out of heap-space. Just open the tdbloader2 or the tdbquery app with a text-editor and alter the -Xmx tag to the size you like, e.g.:

JVM_ARGS=${JVM_ARGS:--Xmx4096M}

Also make sure to set the JENAROOT as described in the link.

Joshua Taylor
  • 80,876
  • 9
  • 135
  • 306
tadumtada
  • 145
  • 1
  • 1
  • 9