I created a regular expression to match tokens in a german text
which is of type string
.
My Regular expression is working as expected using regex101.com. Here is a link of my regex with an example sentence: My regex + example on regex101.com
So I implemented it in python 2.7
like this:
GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
([A-ZÄÖÜ]\.)+ # abbrevations including ÄÖÜ
|\d+([.,]\d+)?([€$%])? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():-_'!] # matches special characters including !
'''
def tokenize_german_text(text):
'''
Takes a text of type string and
tokenizes the text
'''
matchObject = re.findall(GERMAN_TOKENIZER, text)
pass
tokenize_german_text(u'Das ist ein Deutscher Text! Er enthält auch Währungen, 10€')
Result:
When I was debugging this I found out that the
matchObject
is only a list containing 11 entries with empty characters. Why is it not working as expected and how can I fix this?