24

There is this fancy infobox in <some Wikipedia article>. How do I get the value of <this field and that>?

Tgr
  • 25,494
  • 11
  • 77
  • 108
  • Thanks for creating a question and an answer that make sense and I can point to from the hundreds of poorly formulated questions on the matter. :) – Nemo Jun 10 '16 at 06:09
  • 1
    Possible duplicate of [Getting the Infobox data from Wikipedia](http://stackoverflow.com/questions/3312346/getting-the-infobox-data-from-wikipedia) – Termininja Dec 09 '16 at 19:32

2 Answers2

41

The wrong way: trying to parse HTML

Use (cURL/jQuery/file_get_contents/requests/wget/more jQuery) to fetch the HTML article code of the article, then use a DOM parser to extract table.infobox tr[3] td / use a regex.

This is actually a really bad idea most of the time. Wikipedia's HTML code is not particularly parsing-friendly (especially infoboxes which are a system of hand-written templates), the exact structure changes from infobox to infobox, and the structure of an infobox might change over time. You might also miss out on some features that would be otherwise available, such as internationalization.

The other wrong way: trying to parse wikitext

At a glance, the wikitext of some articles looks like it's a pretty straightforward representation of the infobox:

{{ Infobox Foo
| param1 = bar
| param2 = 123
...

In reality, that's not the case. Templates are "recursive" so you might run into stuff like param1 = {{convert|10|km|mi}}; template parameters might contain complex wikitext or HTML markup; some parameters might be missing from the article wikitext and fetched by the template from a subpage or other data repository. Just finding out where a parameter starts and ends might not be a simple business if it contains other templates which have their own parameters.

The ideal way: using a structured data source

There are various projects to provide the information contained in Wikipedia infoboxes in a structured form; the two large ones are Wikidata and DBpedia.

Wikidata is a project to build a knowledge base containing structured data; it is maintained by the same global movement that built Wikipedia, so information is in the process of being moved over. This is a manual process, so not all information in Wikipedia is available via Wikidata, on the other hand there is a lot of information that's in Wikidata but not in Wikipedia. You can find the Wikidata page of an article and see what information it contains by following the Wikidata item link in the left-hand toolbar on the article page; programmatically, you can access Wikidata information using the wbgetentities API module (sandbox, explanation of concepts), e.g. wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Albert_Einstein. There is also a SPARQL endpoint, database dumps, and clients in PHP, Java and Python.

DBPedia is a project to harvest Wikipedia infobox information by automated means and publish it in a structured form. You can find the DBPedia page for a Wikipedia article by going to http://dbpedia.org/page/<Wikipedia article name>, e.g. http://dbpedia.org/page/Albert_Einstein. It has many data formats, dumps, a SPARQL endpoint and various other things.

The wrong ways done right

If the information you need is not available via Wikidata or DBpedia, there are still semi-structured ways of extracting data from infoboxes. For HTML-based extraction you can use Wikipedia's REST content API (e.g. https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein) which returns a richer, more semantic HTML than the one used on normal article pages, and preserves in it some information about template structure.

Alternatively, you might start from wikitext and parse it into a syntax tree using the simpler, client-side mwparserfromhell Python module (docs) or the more powerful parsoid-jsapi which interacts with the Wikipedia REST content service.

A higher-level Python library which tries to extract infobox contents from wikitext is wptools.

Tgr
  • 25,494
  • 11
  • 77
  • 108
  • I can use SPARQL wikidata or DBPedia? Which is better for getting Wikipedia data? – Alexan Sep 25 '16 at 00:34
  • 2
    @Alex depends on your use case. DBPedia tends to be more complete; Wikidata tends to be deeper and more semantic. – Tgr Apr 26 '17 at 12:01
  • I am not sure this is an ideal way tho, you'd get fewer results than parsing it yourself, even tho parsing it yourself is longer, because dpedia ok gives you date but lots of the times misses the whole date, for example on here there isn't the year on the date http://dbpedia.org/page/Victory_Tests, while the actual wikipedia page has the year too https://en.wikipedia.org/wiki/Victory_Tests so dunno, it's longer, but a manual parsing is better – rob.m Aug 06 '17 at 15:06
  • Also remember that DBPedia is not synchronized in real-time with Wikipedia, you may experience a few months delay between wikipedia version and corresponding entry in DBPedia. – ThomasFrancart Jul 30 '19 at 10:20
  • I bet about half the people getting to this answer are trying to parse Wikipedia with the intention to add the data to wikidata. – Matthias Winkelmann Sep 28 '20 at 19:13
0

The accepted answer is correct on all points, and especially the subtext that parsing wikitexxt is horrible.

If, however, getting your data from Wikidata doesn't quite work for you, because (just hypothetically) you're the person trying to move data from WP to WD, I believe the format you are looking for is the parsetree. Here is what it looks like:

<...lots of other stuff omitted>
<template lineStart= "1">
   <title>Datatable TableRow</title>
   <part>
      <name>Picture         </name>
      <equals>=</equals>
      <value> Picture 2013-07-26.jpg</value>
   </part>
   <part>
      <name>Inscription    </name>
      <equals>=</equals>
      <value> This is an Inscription on visible on the image</value>
   </part>
   <part>
      <name>NS           </name>
      <equals>=</equals>
      <value> 54.0902049</value>
   </part>
   <part>
      <name>EW           </name>
      <equals>=</equals>
      <value> 12.1364164</value>
   </part>
   <part>
      <name>Region       </name>
      <equals>=</equals>
      <value> DE-MV</value>
   </part>
   <part>
      <name>Name         </name>
      <equals>=</equals>
      <value> Person, Anna</value>
   </part>
   <part>
      <name>Location          </name>
      <equals>=</equals>
      <value> Lange Stra\u00dfe&amp;nbsp;14&lt;br /&gt;&lt;small&gt;ex: Lange Stra\u00dfe&amp;nbsp;89&lt;/small&gt;</value>
   </part>
   <part>
      <name>Date </name>
      <equals>=</equals>
      <value> </value>
   </part>
</template>

Here's an URI to such a request with the Mediawiki API Sandbox. Note the list of properties that includes parsetree. I've included some other categories (including categories) just in case, and you probably want to trim the list to what you actually need, to save your time and others' servers.

Matthias Winkelmann
  • 13,735
  • 5
  • 55
  • 67
  • The parse tree is definitely helpful but doesn't solve the issues that 1) an infobox parameter might be a template itself (unit conversion, date math, multiline formatting etc etc) 2) part of the information might come from elsewhere (e.g. city infoboxes often use demographic information stored on a different wiki page; in more horrible cases, the data is stashed into a huge Lua table) 3) the infobox might do some fairly complex manipulation of the arguments (e.g. many enwiki infoboxes generate a page description, but there isn't really a way to guess that from the raw wikitext). – Tgr Sep 30 '20 at 01:49
  • Granted other methods probably won't be any better either. If you are looking for getting the most information about the infobox, the Parsoid HTML is probably the most rich, as it contains both template names and parameters and rendered HTML, but probably trickier to use. – Tgr Sep 30 '20 at 01:51