Python 2.7 Regex Tokenizer Implementation Not Working

Question

I created a regular expression to match tokens in a german text which is of type string.

My Regular expression is working as expected using regex101.com. Here is a link of my regex with an example sentence: My regex + example on regex101.com

So I implemented it in python 2.7 like this:

GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
([A-ZÄÖÜ]\.)+  # abbrevations including ÄÖÜ
|\d+([.,]\d+)?([€$%])? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():-_'!] # matches special characters including !
'''

def tokenize_german_text(text):
    '''
        Takes a text of type string and 
        tokenizes the text
    '''
    matchObject = re.findall(GERMAN_TOKENIZER, text)
    pass

tokenize_german_text(u'Das ist ein Deutscher Text! Er enthält auch Währungen, 10€')

Result:

When I was debugging this I found out that the matchObject is only a list containing 11 entries with empty characters. Why is it not working as expected and how can I fix this?

Additional note: You might want to compile your regex with the `re.UNICODE` option in order to allow `\w` to match non-ASCII letters - after all, there are German words that contain accents other than umlauts, and you're missing those right now. — Tim Pietzcker, Jun 24 '17 at 14:00

Tim Pietzcker · Accepted Answer · 2017-06-24T14:02:44.023

re.findall() collects only the matches in capturing groups (unless there are no capturing groups in your regex, in which case it captures each match).

So your regex matches several times, but every time the match is one where no capturing group is participating. Remove the capturing groups, and you'll see results. Also, place the - at the end of the character class unless you actually want to match the range of characters between : and _ (but not the - itself):

GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
(?:[A-ZÄÖÜ]\.)+  # abbrevations including ÄÖÜ
|\d+(?:[.,]\d+)?[€$%]? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():_'!-] # matches special characters including !
'''

Result:

['Das', 'ist', 'ein', 'Deutscher', 'Text', '!', 'Er', 'enthält', 'auch', 'Währungen', ',', '10€']

Python 2.7 Regex Tokenizer Implementation Not Working

1 Answers1