1

I'm trying to scrape the helpful Infobox from most Automobile pages, however I'm messing up the syntax. From other helpful SO posts, I've found a handy method of scraping a standard Infobox template (the example given was for hydrogen):

https://en.m.wikipedia.org/w/index.php?action=raw&title=Template:Infobox%20hydrogen

I can use a similar process to pull the Ford Pinto page (using this as it only has a single Infobox, as there was only one, infamous, model generation):

https://en.m.wikipedia.org/w/index.php?action=raw&title=Ford_Pinto

This page, and most automobile pages, used one of the vehicle-specific Infobox templates, in this case "Infobox automobile" (sorry for the massive block, I'm going to edit this once it's posted as I'm on mobile and I read up on SO formatting):

{{Infobox automobile
| name = Ford Pinto
| image = Ford Pinto.jpg
| caption = Ford Pinto
| manufacturer = [[Ford Motor Company|Ford]]
| aka = Mercury Bobcat
| production = September 1970–1980
| model_years = 1971–1980 (Pinto)<br> 1974–1980 (Bobcat)
| assembly = '''United States:''' {{ubl|[[Edison, New Jersey]] ([[Edison Assembly]])|[[Milpitas, California]] ([[San Jose Assembly Plant|San Jose Assembly]])}}'''Canada:''' {{ubl|[[Southwold, Ontario]] ([[St. Thomas Assembly]])}}
| designer = Robert Eidschun (1968)<ref name=bbw20091030>
...
Snipped some useless stuff
...
</ref>
| class = [[Subcompact car]]
| body_style = 2-door [[Sedan (automobile)|sedan]]<br/>2-door [[sedan delivery]]<br/>2-door [[station wagon]]<br/> 3-door [[hatchback]]
| related = [[Ford Pinto#Mercury Bobcat (1974–1980)|Mercury Bobcat]]<br>[[Ford Mustang (second generation)|Ford Mustang II]]<br> [[Pangra]]
| layout = [[Front-engine, rear-wheel-drive layout|FR layout]]
| engine = {{unbulleted list
  | 1.6L ''[[Ford Kent engine|Kent]]'' I4
  | 2.0L ''[[Ford Pinto engine|EAO]]'' I4
  | 2.3L ''[[Ford Pinto engine|OHC]]'' I4
  | 2.8L ''[[Ford Cologne engine|Cologne]]'' V6
  }}
| transmission = {{unbulleted list
  | 4-speed manual
  | 3-speed ''[[Ford C3 transmission|C3/"Selectshift/Cruise-O-Matic"]]'' automatic
...
Snipped
...
</ref>
|wheelbase = {{convert|94.0|in|mm|abbr=on}}<ref>
...
Snipped
...
  }}
| wheelbase = {{convert|94.0|in|mm|abbr=on}}<ref>
...
Snipped
...
</ref>
| length = {{convert|163|in|mm|abbr=on}}
| width = {{convert|69.4|in|mm|abbr=on}}
| height = {{convert|50|in|mm|abbr=on}}
| weight = {{convert|2015|–|2270|lb|abbr=on}} (1971)
| predecessor = [[Ford Cortina|Ford Cortina (captive import)]]
| successor = [[Ford Escort (North America)|Ford Escort]]
}}

Though not as pretty as the above, another alternative is to use the REST API and slim the page down to just the article, in HTML, which will let me use a standard HTML parser to pull just the Infobox HTML table (link should work in chrome, but will definitely work on an Android device):

view-source:https://en.wikipedia.org/api/rest_v1/page/html/Ford_Pinto

<table class="infobox hproduct" style="width:22em" about="#mwt7" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"Infobox automobile\n","href":"./Template:Infobox_automobile"}

I can handle parsing either of these for the information I want, namely performance information - model, years, drivetrain layout, engines, transmissions, wheelbase, weight - however despite trying various API/other urls, I've yet been unable to directly scrape just the Infobox using the API alone. Also, I'm not exactly sure what the difference is between using an api.php? action=parse url vs an index.php? action=raw - any clarification here is welcome, though I don't think directly relevant. Here are some unsuccessful examples of what I have tried, each with different errors/results:

https://en.wikipedia.org/w/ <- append the following to this base link as I can't post a bunch of links

api.php?action=parse&page=Template:Infobox%20automobile%20Ford_Pinto&format=json

api.php?action=query&titles=Template:Infobox%20automobile%20Ford%20Pinto&prop=revisions&rvprop=content&format=json&formatversion=2

api.php?action=query&titles=Template:Infobox%20automobile%20Ford_Pinto&prop=revisions&rvprop=content&format=json&formatversion=2

index.php?action=raw&title=Template:Infobox%20automobile%20Ford_Pinto

index.php?action=raw&title=Template:Infobox%20automobile%20Ford%20Pinto

index.php?action=raw&titles=Template:Infobox%20automobile%20Ford_Pinto

index.php?action=raw&titles=Template:Infobox%20automobile%20Ford%20Pinto

This is different from various other Infobox scraping questions as these articles use a specific Infobox template that prevents me from using the very successful API url I've posted above, although I'm sure this is user error and a simple fix. Thank you for your time in reading and assisting!

Edit: the suggested page is the way I'm already trying, and failing. Per that page, I am attempting the 'wrong' way until someone, including myself, figures out what I'm doing wrong - assuming there is a right way for the non standardized/base Infobox Templates. Failing any new information in a day or so, I'll just accept the currently suggested answer to reward that user's helpfulness - but I really hoped I'd get a few more attempts, which is why I created an account and asked the hive mind after searching and failing to find an answer from the many other questions I checked. By the way, any attention is good attention, so thank you kindly for taking the time to look over this!

JesseChaos
  • 11
  • 4
  • I've found an alternative, which will let me pare the page down to just the Infobox, but in ugly html. Updating my question with my partial answer (it's better than parsing the entire page but still not ideal, the hydrogen example is ideal and I'm sure it can be done if I could figure out what is wrong with my syntax). – JesseChaos Apr 21 '18 at 01:26
  • Possible duplicate of [How to extract information from a Wikipedia infobox?](https://stackoverflow.com/questions/33862336/how-to-extract-information-from-a-wikipedia-infobox) – Tgr Apr 21 '18 at 20:38
  • This is different as I've shown in my examples, I am trying the "correct way" per that very post and failing. You can see the API links I've used in my post, and their similarity to the suggested example in that post: wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Albert_Einstein vs a more specific API call specifically for the Infobox as I have above: (working example) https://en.m.wikipedia.org/w/index.php?action=raw&title=Template:Infobox%20hydrogen (failure example) https://en.wikipedia.org/w/index.php?action=raw&title=Template:Infobox%20automobile%20Ford_Pinto – JesseChaos Apr 21 '18 at 23:03
  • Those links are not similar at all. You can see how the infobox is used at https://en.wikipedia.org/wiki/Ford_Pinto?action=raw but parsing it is not that easy - the reasons and better alternatives are all explained in the linked question. – Tgr Apr 27 '18 at 14:33

2 Answers2

0

Wikipedia infoboxes are not designed to be scrapeable. They are primarily a (front-end!) templating mechanism; the fact that they sometimes contain structured data is incidental.

The difference between action=raw and action=parse is simply that raw gives you the original wikitext (like you'd see if you clicked the "Edit" link on the article), and parse gives you the rendered HTML. Neither of these is likely to be much use for your purposes.

Your best bet will be to use data from a downstream project like DBpedia which has already done the dirty work of parsing these articles. For instance, here's their parsed data for the Ford Pinto:

http://dbpedia.org/snorql/?describe=http%3A//dbpedia.org/resource/Ford_Pinto

duskwuff -inactive-
  • 171,163
  • 27
  • 219
  • 269
  • Thank you for your reply. It's good to know that there is a reason for it being so difficult. I strongly believe that it's possible to at least pull only the Infobox template using the API - if not I'll just grep for the "Infobox Automobile" and go from there. I have looked at dbpedia and others and they're even more of a mess :/ – JesseChaos Apr 20 '18 at 23:16
  • @JesseChaos The API isn't going to help you. As far as MediaWiki is concerned, the infobox template is part of the article text, and it gets turned into HTML just like the rest of the article text. – duskwuff -inactive- Apr 21 '18 at 00:33
  • But the hydrogen example looks so good! It works exactly how I want it to. I know I must be doing something wrong with that url. I guess I'll get started on grepping - I'll have to do a bit of that anyway. Thanks, @duskwuff ! – JesseChaos Apr 21 '18 at 01:11
  • @JesseChaos The Wikipedia pages on chemical elements are written by different people than the articles about cars. As such, they're structured differently. (IIRC, some of the chemical infoboxes might be automatically created from external databases.) – duskwuff -inactive- Apr 21 '18 at 01:14
  • That's fair I found another alternative, now I'll A/B test coding between the two - if the scraper libs I have are more effective than some quick grepping of the json. Still holding out that there's some change that could be made to the url to get exactly what I want. Thanks for all the help, and for the additional information - it's interesting how much of a superbly massive mess Wikipedia is behind the scenes. Truly eye opening! Feel free to further inform me - if I get nothing else in a day or two I'll select your answer so you get the rep points or however this karma system works @duskwuff – JesseChaos Apr 21 '18 at 01:37
0

You could use the Parsoid output of the page https://en.wikipedia.org/api/rest_v1/page/html/Ford_Pinto, which has a table with class infobox and a data-mw with some JSON encoded data. I would look for the infoboxes (CSS selector table.infobox) and run a JSON.parse on the data-mw value. Check if the infobox has href ./Template:Infobox_automobile then look for the right parameter (in the params field).

Here's a sample request:

curl -X GET --header 'Accept: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/1.6.1"' 'https://en.wikipedia.org/api/rest_v1/page/html/Ford_Pinto'

In addition to the Accept header it's also good to set the User-Agent header to a unique string.

Bernd S
  • 1,228
  • 1
  • 11
  • 18