0

I am trying to "clean" up some wikipedia-text that I have downloaded for a project. My problem is that the text is filled with "noise" that i would like to remove like datatables.

This is a snip of the text i am trying to parse:

test = 
South Midland \'arm\' and \'barb\' rhyming with \'form\' and \'orb.\' Unique 
words in Alabama English include: redworm (earthworm), peckerwood 
(woodpecker), snake doctor and snake feeder (dragonfly), tow sack (burlap 
bag), plum peach (clingstone), French harp (harmonica), and dog irons 
(andirons).<ref name="city-data.com"/>',
 '',
 '{|class="wikitable sortable" style="margin-left:1em; float:center"',
 "|+ '''Top 10 Non-English Languages Spoken in Alabama'''",
 '|-',
 '! Language !! Percentage of population<br /><small>({{as of|2010|lc=y}}) 
</small><ref>{{cite web|url=http://www.city-data.com/states/Alabama- 
Languages.html"|title=Alabama – Languages|work=city-data.com|accessdate=July 
21, 2015}}</ref>',
 '|-',
 '| Spanish|| 2.2%',
 '|-',
 '| German || 0.4%',
 '|-',
 '| French (incl. Patois, Cajun) || 0.3%',
 '|-',
 '| Chinese, [[Vietnamese language|Vietnamese]], [[Korean language|Korean]], 
[[Arabic language|Arabic]], [[African languages]], Japanese, and Italian 
(tied)|| 0.1%',
 '|}',
 '',

I want to remove the datatables from the text and they are delimited by {| and |}

I have researched and tried to use regular expressions and have come up with the following:

re.sub(r'\{|(.*?)\|}', '', test)

But this seems to just remove the delimiters themselves and not everything in between.

Can someone help me learn here :)

M.Van
  • 33
  • 5
  • 1
    You did not escape the first `|`, and you need to use `re.DOTALL` if the matches span across multiple lines. Use `re.sub(r'\{\|.*?\|}', '', test, flags=re.DOTALL)` – Wiktor Stribiżew Nov 29 '18 at 08:47
  • Thank you very much, this works perfectly! Did not know about the flags=re.DOTALL. – M.Van Nov 29 '18 at 08:50
  • This is outside the scope of your specific question, but have you checked existing tools to process the wiki dumps? I don't have much experience, but https://github.com/attardi/wikiextractor seem to have a lot of export and processing options. – sbat Nov 29 '18 at 08:52
  • I have searched but haven't been able to find anything that seem relevant, however the link you posted seems promising, thank you very much :) – M.Van Nov 29 '18 at 08:54
  • Just to clarify: The dot means any character besides new lines, the flag specifies you want it to mean a new line too. You could have also used `{\|(.*\n.*)*\|}` for the regex. It reads: begin with brace and pipe then anything as many times as possible followed by a new line and again anything and end with a pipe and brace. – Borisu Nov 29 '18 at 09:00

0 Answers0