I’m trying to use the available Wiktionary data dump downloads, which I have been translating with Java and its regex classes, specifically Pattern and Matcher, with fair success.
The word definition dumps, which are my main interest, are in raw wiki-markup, which is not html nor xml, etc., but its own unique format. There are many different elements, but the most difficult to deal with are the templates.
What I’ve come up against are specific templates that have positional fields, as well as optional ones, which can appear in any order. I have been able to come up with regular expressions, which almost do the job, but are not quite adequate to handle every instance I encounter, where fields are switched around or optionally omitted.
I am realizing from this, that I do not know how to designate regex group position when order of occurrence is more sophisticated than just a sequence.
An example of one of these complicated templates is that of “term”, documented on the following page: http://en.wiktionary.org/wiki/Template:term
My best stab at the regex (omitting, for now, the extra escape characters needed to make the string Java compatible) is the following:
\{\{term\|(.+?)(?:\|(.*?))?(?:\|([\w, -]+?))?(?:\|lang=([\w-]+?))?(?:\|sc=(\w+?))?(?:\|tr=([\w, -]+?))?(?:\|pos=(\w+?))?(?:\|lit=([\w, -]+))?\}\}
This works for genuine examples of term templates encountered such as:
{{term|λόγος|logos|word|lang=grc}}
{{term|verbum|verbō|for the word|lang=la}}
{{term|*bʰer-||to carry|lang=ine-pro}}
{{term|alternative lifestyle|lang=en}}
{{term|שוין||already|lang=yi|tr=shoyn}}
{{term|Bögge||goblin, snot|lang=nds}}
{{term|as}}
But it fails to work properly for the following:
{{term|deus ex māchinā||device|pos=n|lit=god from a device|lang=la}}
{{term|ри̏ба||fish|tr=rȉba|sc=Cyrl|lang=sh}}
{{term|שוין|lang=yi|tr=shoyn}}
{{term|lang=en|vocational}}
There’s got to be a way of specifying that some groups are positional and some can appear randomly, instead of just optionally in a specific sequence. This should be, for instance, a common issue when processing many HTML elements. I would very much appreciate any advice on how to write the regex to deal with this positional sophistication. Thanks so much! – Jeff.