17

I'm new to querying DBPedia. How can I get all companies from http://dbpedia.org/sparql?

This query returns only 50'000 organizations:

SELECT DISTINCT * WHERE {?company a dbpedia-owl:Company}
Anton
  • 505
  • 4
  • 16

2 Answers2

27

You're right that your query isn't returning all the companies. The pattern is correct, though. Notice that this query which only counts the companies returns 88054:

prefix dbpedia-owl: <http://dbpedia.org/ontology/>

select (count(distinct ?company) as ?count)
where {
  ?company a dbpedia-owl:Company
}

SPARQL results

I think this is a limit imposed by the DBpedia SPARQL endpoint for performance reasons. One thing that you could do is download the data and run your query locally, but that's probably a bit more work than you want. Instead, you can order the results (it doesn't really matter how, so long as you always do it the same way) and use limit and offset to select within those results. For instance:

prefix dbpedia-owl: <http://dbpedia.org/ontology/>

select ?company
where {
  ?company a dbpedia-owl:Company
}
order by ?company
limit 10

SPARQL results

prefix dbpedia-owl: <http://dbpedia.org/ontology/>

select ?company
where {
  ?company a dbpedia-owl:Company
}
order by ?company
limit 10
offset 5823

SPARQL results

This is the general approach. However, it still has a problem on DBpedia because of a hard limit on 40000 results. There's a documentation article which mentions this:

Working with constraints DBpedia's SPARQL endpoint MaxSortedTopRows Limits via LIMIT & OFFSET

The DBpedia SPARQL endpoint is configured with the following INI setting:

MaxSortedTopRows = 40000

The setting above sets a threshold for sorted rows.

The proposed solution from that article is to use subqueries:

To prevent the problem outlined above you can leverage the use of subqueries which make better use of temporary storage associated with this kind of quest. An example would take the form:

SELECT ?p ?s 
WHERE 
  {
    {
      SELECT DISTINCT ?p ?s 
      FROM <http://dbpedia.org> 
      WHERE   
        { 
          ?s ?p <http://dbpedia.org/resource/Germany> 
        } ORDER BY ASC(?p) 
    }
  } 
OFFSET 50000 
LIMIT 1000

I'm not entirely sure why this solves the problem, perhaps it's that the endpoint can sort more than 40000 rows, as long as it doesn't have to return them all. At any rate, it does work, though. Your query would become:

prefix dbpedia-owl: <http://dbpedia.org/ontology/>

select ?company {{
  select ?company { 
    ?company a dbpedia-owl:Company
  }
  order by ?company
}} 
offset 88000
LIMIT 1000
Joshua Taylor
  • 80,876
  • 9
  • 135
  • 306
  • However, there is a problem with sorting more than 40000 records set: `select ?company where { ?company a dbpedia-owl:Company } order by ?company limit 10 offset 40000` – Anton Jan 06 '14 at 11:22
  • Getting `Virtuoso 22023 Error SR353: Sorted TOP clause specifies more then 40010 rows to sort. Only 40000 are allowed. Either decrease the offset and/or row count or use a scrollable cursor` – Anton Jan 06 '14 at 11:22
  • 2
    @Anton Hmm… that does make things more complicated. I think at this point, you'd want to read [Working with constraints DBpedia's SPARQL endpoint MaxSortedTopRows Limits via LIMIT & OFFSET](http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksHowToHandleBandwidthLimitExceed), which addresses this particular issue. – Joshua Taylor Jan 06 '14 at 17:55
  • Thanks! Just wanted to post this link :) Going to give it a try. I wonder if such queries return non-intersecting sets. – Anton Jan 06 '14 at 18:01
  • @Anton, I just updated my answer to show how your query can be achieved using this technique. Looks like you'll be able to get all the companies. :) – Joshua Taylor Jan 06 '14 at 18:06
  • Query # 1: `SELECT ?company WHERE { { SELECT DISTINCT ?company FROM WHERE { ?company a dbpedia-owl:Company } ORDER BY ASC(?company) } } LIMIT 50000` and the same query with offset 50000 did the trick. Checking the uniqueness: `$ cat sparql sparql\(1\) | wc -l 88056` To be sure: `$ cat sparql sparql\(1\) | sort | uniq | wc -l 88055` – Anton Jan 06 '14 at 18:19
1

Another method to get all the companies from DBpedia is to simply run with RDFSlice the following query:

SELECT * 
WHERE {
{?s a <http://dbpedia.org/ontology/Person>.?s ?p ?o.} 
UNION
{?s1 a <http://dbpedia.org/ontology/Person>.?o1 ?p1 ?s1.}
}

This has the added advantage of offering you all the triples. It takes anywhere from few minutes to several hours depending on your RAM and CPU power.

paxRoman
  • 2,056
  • 3
  • 18
  • 28