0

For research purpose, I would like to parse some dumps from the french wikipedia. Here's an extract of the XML fil I want to parse :

 <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="fr">
  <siteinfo>
    <sitename>Wikipédia</sitename>
    <dbname>frwiki</dbname>
    <base>https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal</base>
    <generator>MediaWiki 1.27.0-wmf.15</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Média</namespace>
      <namespace key="-1" case="first-letter">Spécial</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Discussion</namespace>
      <namespace key="2" case="first-letter">Utilisateur</namespace>
      <namespace key="3" case="first-letter">Discussion utilisateur</namespace>
      <namespace key="4" case="first-letter">Wikipédia</namespace>
      <namespace key="5" case="first-letter">Discussion Wikipédia</namespace>
      <namespace key="6" case="first-letter">Fichier</namespace>
      <namespace key="7" case="first-letter">Discussion fichier</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">Discussion MediaWiki</namespace>
      <namespace key="10" case="first-letter">Modèle</namespace>
      <namespace key="11" case="first-letter">Discussion modèle</namespace>
      <namespace key="12" case="first-letter">Aide</namespace>
      <namespace key="13" case="first-letter">Discussion aide</namespace>
      <namespace key="14" case="first-letter">Catégorie</namespace>
      <namespace key="15" case="first-letter">Discussion catégorie</namespace>
      <namespace key="100" case="first-letter">Portail</namespace>
      <namespace key="101" case="first-letter">Discussion Portail</namespace>
      <namespace key="102" case="first-letter">Projet</namespace>
      <namespace key="103" case="first-letter">Discussion Projet</namespace>
      <namespace key="104" case="first-letter">Référence</namespace>
      <namespace key="105" case="first-letter">Discussion Référence</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Discussion module</namespace>
      <namespace key="2300" case="first-letter">Gadget</namespace>
      <namespace key="2301" case="first-letter">Gadget talk</namespace>
      <namespace key="2302" case="case-sensitive">Gadget definition</namespace>
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
      <namespace key="2600" case="first-letter">Sujet</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>Antoine Meillet</title>
    <ns>0</ns>
    <id>3</id>
    <revision>
      <id>123903866</id>
      <parentid>123513568</parentid>
      <timestamp>2016-03-02T15:50:55Z</timestamp>
      <contributor>
        <username>RobokoBot</username>
        <id>2090299</id>
      </contributor>
      <minor/>
      <comment>Ajout d'une puce avant {{Autorité}} suite à la modification du modèle</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text id="123933672" bytes="9683" />
      <sha1>f9e9rj6s5eistpyimc4xqrtauni5uc3</sha1>
    </revision>
  </page>
  <page>
    <title>Algèbre linéaire</title>
    <ns>0</ns>
    <id>7</id>
    <revision>
      <id>123705494</id>
      <parentid>121738150</parentid>
      <timestamp>2016-02-25T16:21:28Z</timestamp>
      <contributor>
        <username>Anareth</username>
        <id>2426186</id>
      </contributor>
      <minor/>
      <comment>/* Histoire */ grammaire</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text id="123731807" bytes="17107" />
      <sha1>iewjt56i5p1bhxup95b9bp08r5u0t9u</sha1>
    </revision>
  </page>

But when I try to parse it with the code

library(XML)
data <- xmlParse("test.xml")

I get the following error :

Error: 1: Extra content at the end of the document

I tried also with xml2 package, with same result.

Can you have a solution ?

Thanks by advance.

Léo Joubert
  • 450
  • 4
  • 16
  • See this [other answer](http://stackoverflow.com/a/16972780/2641825) "XML can only have one "document entity" or "root"" you have one `siteinfo` and three `page` inside your xml without a single higher root. – Paul Rougieux Mar 25 '16 at 08:59
  • Check out the `WikipediR` package, which wraps the Wikipedia API; it can clean up the results a bit for you, or give you HTML if you want to parse with `rvest` or whatever your favorite scraping/parsing package is. – alistaire Mar 25 '16 at 10:10

2 Answers2

2

Disclaimer: XML is not my field of work and the code below may not be the recommended way to do it. But at least it works.

Add a new tag such as <wikidump> at the beginning and </wikidump> at the end of the xml file. For example the beginning of your file becomes:

<wikidump>
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="fr" />
  <siteinfo>
    <sitename>Wikipédia</sitename>
....

And the end looks like

      <sha1>9smdmhiguf5lxyrfqxq2j6m66tq65y9</sha1>
    </revision>
  </page>
</wikidump>

Then this code can read the xml object without problem:

library(XML)
wikidata <- xmlParse("test.xml")

You could add those two <wikidump> lines programmatically if you want with

# open a new file test2.xml for writing
xmlfile <- file("test2.xml","w")
writeLines("<wikidump>", xmlfile)
# add content of test.xml
writeLines(readLines("test.xml"), xmlfile)
writeLines("</wikidump>", xmlfile)
close(xmlfile)
Paul Rougieux
  • 7,937
  • 1
  • 49
  • 82
2

Use htmlTreeParse like this. (For purposes of reproducibility we use Lines in the note at the end.)

library(XML)
doc <- htmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

# test it by extracting all contributors
xpathSApply(xmlRoot(doc), "//contributor", xmlValue)
## [1] "RobokoBot2090299" "Anareth2426186"   "Prospaire2133855"

Note: This input was used:

Lines <- '<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="fr" />
  <siteinfo>
    <sitename>Wikipédia</sitename>
    <dbname>frwiki</dbname>
    <base>https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal</base>
    <generator>MediaWiki 1.27.0-wmf.15</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Média</namespace>
      <namespace key="-1" case="first-letter">Spécial</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Discussion</namespace>
      <namespace key="2" case="first-letter">Utilisateur</namespace>
      <namespace key="3" case="first-letter">Discussion utilisateur</namespace>
      <namespace key="4" case="first-letter">Wikipédia</namespace>
      <namespace key="5" case="first-letter">Discussion Wikipédia</namespace>
      <namespace key="6" case="first-letter">Fichier</namespace>
      <namespace key="7" case="first-letter">Discussion fichier</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">Discussion MediaWiki</namespace>
      <namespace key="10" case="first-letter">Modèle</namespace>
      <namespace key="11" case="first-letter">Discussion modèle</namespace>
      <namespace key="12" case="first-letter">Aide</namespace>
      <namespace key="13" case="first-letter">Discussion aide</namespace>
      <namespace key="14" case="first-letter">Catégorie</namespace>
      <namespace key="15" case="first-letter">Discussion catégorie</namespace>
      <namespace key="100" case="first-letter">Portail</namespace>
      <namespace key="101" case="first-letter">Discussion Portail</namespace>
      <namespace key="102" case="first-letter">Projet</namespace>
      <namespace key="103" case="first-letter">Discussion Projet</namespace>
      <namespace key="104" case="first-letter">Référence</namespace>
      <namespace key="105" case="first-letter">Discussion Référence</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Discussion module</namespace>
      <namespace key="2300" case="first-letter">Gadget</namespace>
      <namespace key="2301" case="first-letter">Gadget talk</namespace>
      <namespace key="2302" case="case-sensitive">Gadget definition</namespace>
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
      <namespace key="2600" case="first-letter">Sujet</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>Antoine Meillet</title>
    <ns>0</ns>
    <id>3</id>
    <revision>
      <id>123903866</id>
      <parentid>123513568</parentid>
      <timestamp>2016-03-02T15:50:55Z</timestamp>
      <contributor>
        <username>RobokoBot</username>
        <id>2090299</id>
      </contributor>
      <minor/>
      <comment>Ajout d\'une puce avant {{Autorité}} suite à la modification du modèle</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text id="123933672" bytes="9683" />
      <sha1>f9e9rj6s5eistpyimc4xqrtauni5uc3</sha1>
    </revision>
  </page>
  <page>
    <title>Algèbre linéaire</title>
    <ns>0</ns>
    <id>7</id>
    <revision>
      <id>123705494</id>
      <parentid>121738150</parentid>
      <timestamp>2016-02-25T16:21:28Z</timestamp>
      <contributor>
        <username>Anareth</username>
        <id>2426186</id>
      </contributor>
      <minor/>
      <comment>/* Histoire */ grammaire</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text id="123731807" bytes="17107" />
      <sha1>iewjt56i5p1bhxup95b9bp08r5u0t9u</sha1>
    </revision>
  </page>
  <page>
    <title>Algèbre générale</title>
    <ns>0</ns>
    <id>9</id>
    <revision>
      <id>116367254</id>
      <parentid>109545698</parentid>
      <timestamp>2015-06-28T08:36:29Z</timestamp>
      <contributor>
        <username>Prospaire</username>
        <id>2133855</id>
      </contributor>
      <comment>divers</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text id="116305869" bytes="1965" />
      <sha1>9smdmhiguf5lxyrfqxq2j6m66tq65y9</sha1>
    </revision>
  </page>'
G. Grothendieck
  • 211,268
  • 15
  • 177
  • 297
  • Thank you, this seems to work. I have one more question to ask : I have an encoding issue for page title, accessing with the following xpath : //page/title I fix the issue in the examples I give to you. Can you have a fix to get correct encoding into R ? – Léo Joubert Mar 25 '16 at 14:41
  • The question only showed a partial page but the entire page likely has a `` or similar tag which will be picked up. – G. Grothendieck Mar 25 '16 at 15:42
  • It seems not, but can i add it with R ? – Léo Joubert Mar 25 '16 at 17:10
  • There is an encoding= argument to xmlTreeParse which can be FALSE or an encoding such as "UTF-8" or some other encoding -- you will have to experiment to find the right encoding. This may help: http://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv – G. Grothendieck Mar 25 '16 at 17:26
  • thank, i solved the problem using the encoding argument – Léo Joubert Mar 29 '16 at 08:28