6

I'm struggling with the execution of a SPARQL query in Jena, with a resulting behaviour that I don't understand...

I'm trying to query the Esco ontology (https://ec.europa.eu/esco/download), and I'm using TDB to load the ontology and create the model (sorry if the terms I use are not accurate, I'm not very experienced).

My goal is to find a job position uri in the ontology that matches with the text I have previously extracted: ex: extracted term : "acuponcteur" -> label in ontology: "Acuponcteur"@fr -> uri: <http://ec.europa.eu/esco/occupation/14918>

What I call the "weird behaviour" is related to the results I'm getting (or not) when excuting queries, ie.:

When executing the following query :

PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX esco: <http://ec.europa.eu/esco/model#>      
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>   
SELECT ?position    
WHERE {     
    ?s rdf:type esco:Occupation. 
    { ?position skos:prefLabel ?label. } 
    UNION 
    { ?position skos:altLabel ?label. } 
    FILTER (lcase(?label)= \"acuponcteur\"@fr ) 
}
LIMIT 10 

I get those results after 1 minute :

-----------------------------------------------
| position                                    |
===============================================
| <http://ec.europa.eu/esco/occupation/14918> |
| <http://ec.europa.eu/esco/occupation/14918> |
| <http://ec.europa.eu/esco/occupation/14918> |
| <http://ec.europa.eu/esco/occupation/14918> |
| <http://ec.europa.eu/esco/occupation/14918> |
| <http://ec.europa.eu/esco/occupation/14918> |
| <http://ec.europa.eu/esco/occupation/14918> |
| <http://ec.europa.eu/esco/occupation/14918> |
| <http://ec.europa.eu/esco/occupation/14918> |
| <http://ec.europa.eu/esco/occupation/14918> |
-----------------------------------------------

However, when I'm trying to add the DISTINCT keyword, thus :

PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX esco: <http://ec.europa.eu/esco/model#>      
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>   
SELECT DISTINCT ?position   
WHERE {     
    ?s rdf:type esco:Occupation. 
    { ?position skos:prefLabel ?label. } 
    UNION 
    { ?position skos:altLabel ?label. } 
    FILTER (lcase(?label)= \"acuponcteur\"@fr ) 
}
LIMIT 10 

it seems like the query keeps running forever (i stopped the execution after 20 minutes waiting...)

I get the same behaviour when executing the same query as the first one (thus without DISTINCT), with another label to match, a label that I'm sure is not in the ontology. While expecting empty result, it (seems like it) keeps running and i have to kill it after a while (once again, i waited 20 minutes to the most) :

PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX esco: <http://ec.europa.eu/esco/model#>      
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>   
SELECT ?position    
WHERE {     
    ?s rdf:type esco:Occupation. 
    { ?position skos:prefLabel ?label. } 
    UNION 
    { ?position skos:altLabel ?label. } 
    FILTER (lcase(?label)= \"assistante scolaire\"@fr ) 
}
LIMIT 10 

May it be a problem in the code I'm running? There it is:

public static void main(String[] args) {

    // Make a TDB-backed dataset
    String directory = "data/testtdb" ;
    Dataset dataset = TDBFactory.createDataset(directory) ;

    // transaction (protects a TDB dataset against data corruption, unexpected process termination and system crashes)
    dataset.begin( ReadWrite.WRITE );
    // assume we want the default model, or we could get a named model here
    Model model = dataset.getDefaultModel();

    try {

          // read the input file - only needs to be done once
          String source = "data/esco.rdf";
          FileManager.get().readModel(model, source, "RDF/XML-ABBREV");

          // run a query

          String queryString =
                    "PREFIX skos: <http://www.w3.org/2004/02/skos/core#> " +
                    "PREFIX esco: <http://ec.europa.eu/esco/model#> " +     
                    "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> " +  
                    "SELECT ?position " +   
                    "WHERE { "  +   
                    "   ?s rdf:type esco:Occupation. " +
                    "   { ?position skos:prefLabel ?label. } " +
                    "   UNION " +
                    "   { ?position skos:altLabel ?label. }" +
                    "   FILTER (lcase(?label)= \"acuponcteur\"@fr ) " +
                    "}" +
                    "LIMIT 1 "  ;

          Query query = QueryFactory.create(queryString) ;

          // execute the query
          QueryExecution qexec = QueryExecutionFactory.create(query, model) ;
          try {
              ResultSet results = qexec.execSelect() ;
              // taken from apache Jena tutorial 
              ResultSetFormatter.out(System.out, results, query) ;

          } finally { 
              qexec.close() ; 
          }

      } finally {
          model.close() ;
          dataset.end();
      }

}

What am I doing wrong here? Any idea?

Thanks!

CecileR
  • 85
  • 4

1 Answers1

6

As a first point that may or may not make much difference, you can use a property path to simplify

{ ?position skos:prefLabel ?label. } 
UNION 
{ ?position skos:altLabel ?label. } 

as

?position skos:prefLabel|skos:altLabel ?label 

This makes the query:

SELECT ?position    
WHERE {     
    ?s rdf:type esco:Occupation.                   # (1)
    ?position skos:prefLabel|skos:altLabel ?label  # (2)
    FILTER (lcase(?label)="acuponcteur"@fr ) 
}

What's the point of ?s in this query? There are some number n of ?position/?label pairs that match (2), and some number m values of ?s that match (1). The number of results that you get from the query is m×n, but you never use the value of ?s. It looks like you used DISTINCT to get rid of some repeated values, but you didn't look to see why you were getting repeated values in the first place. You should simply remove the useless line (1), and have the query:

SELECT DISTINCT ?position    
WHERE {     
    ?position skos:prefLabel|skos:altLabel ?label
    FILTER (lcase(?label)="acuponcteur"@fr ) 
}

I wouldn't be surprised if, at the point, you don't even need the DISTINCT anymore.

Joshua Taylor
  • 80,876
  • 9
  • 135
  • 306
  • I'm so ashamed my problem was actually this ?s error... ?s should be ?position as I wanted to select only ?position uri that are of rdf:type esco:Occupation. Thanks for the simplification for UNION! And as for the dataset reading and model creation, I actually put it in another class so that it doesn't have to go through this every time, but i simplified the code for the question. – CecileR Aug 14 '14 at 13:06