I am trying to "clean" up some wikipedia-text that I have downloaded for a project. My problem is that the text is filled with "noise" that i would like to remove like datatables.
This is a snip of the text i am trying to parse:
test =
South Midland \'arm\' and \'barb\' rhyming with \'form\' and \'orb.\' Unique
words in Alabama English include: redworm (earthworm), peckerwood
(woodpecker), snake doctor and snake feeder (dragonfly), tow sack (burlap
bag), plum peach (clingstone), French harp (harmonica), and dog irons
(andirons).<ref name="city-data.com"/>',
'',
'{|class="wikitable sortable" style="margin-left:1em; float:center"',
"|+ '''Top 10 Non-English Languages Spoken in Alabama'''",
'|-',
'! Language !! Percentage of population<br /><small>({{as of|2010|lc=y}})
</small><ref>{{cite web|url=http://www.city-data.com/states/Alabama-
Languages.html"|title=Alabama – Languages|work=city-data.com|accessdate=July
21, 2015}}</ref>',
'|-',
'| Spanish|| 2.2%',
'|-',
'| German || 0.4%',
'|-',
'| French (incl. Patois, Cajun) || 0.3%',
'|-',
'| Chinese, [[Vietnamese language|Vietnamese]], [[Korean language|Korean]],
[[Arabic language|Arabic]], [[African languages]], Japanese, and Italian
(tied)|| 0.1%',
'|}',
'',
I want to remove the datatables from the text and they are delimited by {| and |}
I have researched and tried to use regular expressions and have come up with the following:
re.sub(r'\{|(.*?)\|}', '', test)
But this seems to just remove the delimiters themselves and not everything in between.
Can someone help me learn here :)