1

I’m trying to use the available Wiktionary data dump downloads, which I have been translating with Java and its regex classes, specifically Pattern and Matcher, with fair success.

The word definition dumps, which are my main interest, are in raw wiki-markup, which is not html nor xml, etc., but its own unique format. There are many different elements, but the most difficult to deal with are the templates.

What I’ve come up against are specific templates that have positional fields, as well as optional ones, which can appear in any order. I have been able to come up with regular expressions, which almost do the job, but are not quite adequate to handle every instance I encounter, where fields are switched around or optionally omitted.

I am realizing from this, that I do not know how to designate regex group position when order of occurrence is more sophisticated than just a sequence.

An example of one of these complicated templates is that of “term”, documented on the following page: http://en.wiktionary.org/wiki/Template:term

My best stab at the regex (omitting, for now, the extra escape characters needed to make the string Java compatible) is the following:

\{\{term\|(.+?)(?:\|(.*?))?(?:\|([\w, -]+?))?(?:\|lang=([\w-]+?))?(?:\|sc=(\w+?))?(?:\|tr=([\w, -]+?))?(?:\|pos=(\w+?))?(?:\|lit=([\w, -]+))?\}\}

This works for genuine examples of term templates encountered such as:

{{term|λόγος|logos|word|lang=grc}}
{{term|verbum|verbō|for the word|lang=la}}
{{term|*bʰer-||to carry|lang=ine-pro}}
{{term|alternative lifestyle|lang=en}}
{{term|שוין||already|lang=yi|tr=shoyn}}
{{term|Bögge||goblin, snot|lang=nds}}
{{term|as}}

But it fails to work properly for the following:

{{term|deus ex māchinā||device|pos=n|lit=god from a device|lang=la}}
{{term|ри̏ба||fish|tr=rȉba|sc=Cyrl|lang=sh}}
{{term|שוין|lang=yi|tr=shoyn}}
{{term|lang=en|vocational}}

There’s got to be a way of specifying that some groups are positional and some can appear randomly, instead of just optionally in a specific sequence. This should be, for instance, a common issue when processing many HTML elements. I would very much appreciate any advice on how to write the regex to deal with this positional sophistication. Thanks so much! – Jeff.

Djedefrey
  • 11
  • 4
  • If there are two formats, then why not process the file twice. The first time get all format-a lines, the second, get all format-b lines. Why mash this much into one regex? – aliteralmind Apr 16 '14 at 03:03
  • It's a desire to learn more about regex on my part, not just an immediate practical data acquisition need. I am doing what you propose, right now, to get by. But it seems to me that the issue I'm asking about has an answer and I would like to learn how to do that in regex. – Djedefrey Apr 16 '14 at 03:14

1 Answers1

0

Your regex matches every line, according to RegexBuddy, Java flavor, although I don't understand if it's capturing exactly what you want.

It is extremely slow, however, as debuggex has been churning away on it for about ten minutes now, still with no response. This despite a pretty small set of input.

...Finally:

^\{\{term\|(.+?)(?:\|(.*?))?(?:\|([\w, -]+?))?(?:\|lang=([\w-]+?))?(?:\|sc=(\w+?))?(?:\|tr=([\w, -]+?))?(?:\|pos=(\w+?))?(?:\|lit=([\w, -]+))?\}\}$

Regular expression visualization

Debuggex Demo

It's actually not working on Debuggex. For some reason it's not anchoring to the line start and ends, despite the m flag and the ^ and $ which I've added. They work okay in RegexBuddy.

I'm thinking that this is just not a good problem for regex. Not for a reasonable single regex. Splitting each line on the | is a way better way of handling this problem.

In addition to discouraging you against using regex, I'm also letting you know about the Stack Overflow Regular Expressions FAQ :)

Community
  • 1
  • 1
aliteralmind
  • 18,274
  • 16
  • 66
  • 102
  • Thanks for your response, esp. the regex faq! But I’m realizing that I need to point out that this template example is not a line/record, but just one of many markup elements that can be embedded in a one line definition record. The code I have so far just to process a single line is 700 lines long and includes many, many regex patterns! There are elements nested in other elements too. Therefore, for one thing, using anchors won’t work. – Djedefrey Apr 16 '14 at 04:54
  • Also, when I say certain template examples are not getting parsed right, I didn't mean that the templates aren't being identified, but the capture groups don't always work to isolate one of these | delimited fields within the template. – Djedefrey Apr 16 '14 at 04:57