12

I am trying to find all possible word/tag pairs or other nested combinations with python and its regular expressions.

sent = '(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))'

def checkBinary(sentence):
    n = re.findall("\([A-Za-z-0-9\s\)\(]*\)", sentence)
    print(n)

checkBinary(sent)

Output:
['(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))']

looking for:

['(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))', 
 '(NNP Hoi)', 
 '(NN Hallo)',
 '(NN Hey)', 
 '(NNP (NN Ciao) (NN Adios))',
 '(NN Ciao)',
 '(NN Adios)']

I think the regex formula could find the nested parenthesis word/tag pairs aswell but it doesn't return them. How should I do this?

zmo
  • 22,917
  • 4
  • 48
  • 82
Wolf Vos
  • 167
  • 8

2 Answers2

31

it's actually not possible to do this by using regular expressions, because regular expressions express a language defined by a regular grammar that can be solved by a non finite deterministic automaton, where matching is represented by states ; then to match nested parenthesis, you'd need to be able to match an infinite number of parenthesis and then have an automaton with an infinite number of states.

To be able to cope with that, we use what's called a push-down automaton, that is used to define the context free grammar.

Chomsky's hierarchy

So if your regex does not match nested parenthesis, it's because it's expressing the following automaton and does not match anything on your input:

Regular expression visualization

Play with it

As a reference, please have a look at MIT's courses on the topic:

So one of the ways to parse your string efficiently, is to build a grammar for nested parenthesis (pip install pyparsing first):

>>> import pyparsing
>>> strings = pyparsing.Word(pyparsing.alphanums)
>>> parens  = pyparsing.nestedExpr( '(', ')', content=strings)
>>> parens.parseString('(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))').asList()
[['NP', ['NNP', 'Hoi'], ['NN', 'Hallo'], ['NN', 'Hey'], ['NNP', ['NN', 'Ciao'], ['NN', 'Adios']]]]

N.B.: there exists a few regular expressions engines that do implement nested parenthesis matching using the push down. The default python re engine is not one of them, but an alternative engine exists, called regex (pip install regex) that can do recursive matching (which makes the re engine context free), cf this code snippet:

>>> import regex
>>> res = regex.search(r'(?<rec>\((?:[^()]++|(?&rec))*\))', '(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))')
>>> res.captures('rec')
['(NNP Hoi)', '(NN Hallo)', '(NN Hey)', '(NN Ciao)', '(NN Adios)', '(NNP (NN Ciao) (NN Adios))', '(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))']
Community
  • 1
  • 1
zmo
  • 22,917
  • 4
  • 48
  • 82
  • 4
    CS at it's essence. +1 – PepperoniPizza May 14 '14 at 12:34
  • 5
    Oh modern regex [could match](http://regex101.com/r/dS9kM8) this kind of data. Read about [recursive patterns](http://stackoverflow.com/questions/17845014/what-does-the-regex-mean/17845034#17845034) and/or [balancing groups](http://stackoverflow.com/questions/17003799/what-are-regular-expression-balancing-groups/17004406#17004406). [Reference](http://stackoverflow.com/a/22944075) – HamZa May 14 '14 at 13:08
  • 7
    indeed, and I even suggest a recursive pattern solution at the end. Though, per definition, those are not ***regular*** expressions anymore. – zmo May 14 '14 at 13:09
  • Whoa, I stopped reading before reaching the end. I must check this `regex` module. +1 – HamZa May 14 '14 at 13:12
  • @zmo Although being right about the concept of regular expressions used in modern languages being similar to regular expression in language theory, you are wrong in assuming they are exactly the same. Regular expressions in modern languages cannot be represented by a regular grammer(some). See my answer. – Farhad Alizadeh Noori May 14 '14 at 13:31
  • 3
    @FarhadAliNoo Per definition of [regular expressions](http://en.wikipedia.org/wiki/Regular_expression#Formal_definition) in formal theory, a regular expression is implementing a regular grammar. The engines implementing non-regular expressions should take a different name, like *nregex* (for *non-regular expressions*) or *cfex* (*context-free expressions*)… Even though those abilities have been implemented in *regex* engines, **calling them *regex* is like calling *a plane* "*a car that can fly*"**. So maybe someone should make an article on HN to complain about that :-) – zmo May 14 '14 at 14:27
  • @zmo Yes. I agree. The fact that we are still calling those patterns regular expressions and they have not really been regular expressions for years is very misleading. – Farhad Alizadeh Noori May 14 '14 at 14:30
  • 2
    let's blame the Perl people! :-D – zmo May 14 '14 at 14:36
  • Thanks guys, you've been a great help! – Wolf Vos May 14 '14 at 15:42
2

Regular expressions used in modern languages DO NOT represent regular languages. zmo is right in saying that regular languages in Language Theroy are represented by finite state automata but the regular expressions that use any sort of backtracking like those with capturing groups, lookarounds and etc that are used in modern languages CANNOT be represented by FSAs known in Language Theory. How can you represent a pattern like (\w+)\1 with a DFA or even and NFA?

The regular expression you are looking for can be something like this(only matches to two levels):

(?=(\((?:[^\)\(]*\([^\)]*\)|[^\)\(])*?\)))

I tested this on http://regexhero.net/tester/

The matches are in the captured groups:

1: (NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios))

1: (NNP Hoi)

1: (NN Hallo)

1: (NN Hey)

1: (NNP (NN Ciao) (NN Adios))

1: (NN Ciao)

1: (NN Adios)

Farhad Alizadeh Noori
  • 2,066
  • 14
  • 22
  • 3
    I believe @zmo talked about modern regular expression in the ending `NB` part (and gave an example using recursion). Also be careful, your expression doesn't go deeper than two level of nesting: your first match is missing the closing parenthesis. – Robin May 14 '14 at 13:37
  • Oh you are right ! Yes that regex is only good for two levels of nesting. – Farhad Alizadeh Noori May 14 '14 at 13:43
  • I encourage everyone to read about balancing groups and recursive patterns Hamza mentioned in zmo's post's comment section. A very good read indeed. – Farhad Alizadeh Noori May 14 '14 at 14:17
  • Farhad, Thank you aswell for the help! I think ill go with the other answer but thanks for the quick response! – Wolf Vos May 14 '14 at 15:47