2

Consider the string

cos(t(2))+t(51)

Using a regular expression, I'd like to match cos(t(2)), t(2) and t(51). The general pattern this fits is intended to be something like

variable or function name + opening_parenthesis + contents + closing_parenthesis,

where contents can be any expression that has an equal number of opening and closing parentheses.

I'm using [a-zA-Z]+\([\W\w]*\) which returns cos(t(2)))+t(51), which of course is not the desired result.

Any ideas on how to achieve this using regex? I'm particularly stuck at this "equal number of opening and closing parentheses".

zx81
  • 38,175
  • 8
  • 76
  • 97
niels
  • 760
  • 8
  • 24
  • 5
    Short answer: this is not possible with Regex. – Joeytje50 May 25 '14 at 20:20
  • Slightly longer answer: you'll need to use a dedicated parser if you want to deal with arbitrary nesting. – Oliver Charlesworth May 25 '14 at 20:20
  • 1
    Regular expression matching cannot be used to count an arbitrary number of matching expressions (i.e. balanced expressions). The reason for this is based in automata (language theory) and it basically states that the language your trying to match (i.e. balanced parens) cannot be matched because it is not "regular". You can use a stack/parser though. – SystemFun May 25 '14 at 20:21
  • Some dialects (e.g. php/pcre) support recursive patterns which are able to match nested parens. What's your platform? – georg May 25 '14 at 20:24
  • @niels: no idea about it, but maybe [this](http://stackoverflow.com/questions/22463027/recursive-tricks-with-regexp-in-matlab) helps. – georg May 25 '14 at 20:39
  • Re: parsers. If you're parsing Matlab code you might look into the undocumented [`mlintmex`](http://undocumentedmatlab.com/blog/parsing-mlint-code-analyzer-output) and [`mtree`](http://undocumentedmatlab.com/blog/function-definition-meta-info). – horchler May 26 '14 at 03:31
  • FYI, just added explanation for the regex pattern. – zx81 May 26 '14 at 09:17
  • @Joeytje50 @OliCharlesworth `this is not possible with Regex` But I think my answer does it. :) – zx81 May 26 '14 at 09:24
  • @Vlad `Regular expression matching cannot be used to count an arbitrary number of matching expressions` But the regex in my answer seems to do it. :) – zx81 May 26 '14 at 09:25

1 Answers1

4

Niels, this is an interesting and tricky question because you are looking for overlapping matches. Even with recursion, the task is not trivial.

You asked about any idea how to achieve this with regex, so it sounds like even if this is not available in matlab, you would be interested in seeing an answer that shows you how to do it in regex.

This makes sense to me because tools often change the regex libraries they use. For instance Notepad++, which used to have crippled regex, switched to PCRE in version 6. (As it happens, PCRE would work with this solution.)

In Perl and PCRE, you can use this short regex:

(?=(\b\w+\((?:\d+|(?1))\)))

This will match:

cos(t(2))
t(2)
t(51)

For instance, in php, you could use this code (see the results at the bottom of the online demo).

$regex = "~(?=(\b\w+\((?:\d+|(?1))\)))~";
$string = "cos(t(2))+t(51)";
$count = preg_match_all($regex,$string,$matches);
print_r($matches[1]);

How does it work?

  1. To allow overlapping matches, we use a lookahead. That way, after matching cos(t(2)), the engine will position itself NOT after cos(t(2)), but before the o in cos
  2. In fact the engine does not actually match cos(t(2)) but merely captures it to Group 1. What it matches is the assertion that at this position in the string, looking ahead, we can see x. After matching this assertion, it tries to match it again starting from the next position in the string.
  3. The expression in the lookahead (which describes what we're looking for) is almost very simple: in (\b\w+\((?:\d+|(?1))\)), after the \d+, the alternation | allows us to repeat subroutine number one with (?1), which is to say, the whole expression we are currently within. So we don't recurse the entire regex (which includes a lookahead), but a subexpression thereof.
zx81
  • 38,175
  • 8
  • 76
  • 97
  • You know what, I had not *ever* seen recursion using just regex before. +1 for teaching me something new. – Joeytje50 May 26 '14 at 10:11
  • @Joeytje50 Yes, recursion in regex doesn't come up all that often, but when you need it it's pretty neat. If you're interested in regex tricks, this one applies to many situations and is one of my favorites: [Match (or replace) a pattern except in situations s1, s2, s3 etc](http://stackoverflow.com/questions/23589174/match-or-replace-a-pattern-except-in-situations-s1-s2-s3-etc/23589204#) – zx81 May 26 '14 at 10:47
  • That looks quite useful. That sentence of "Sadly, the technique is not well known" is probably something every programmer has experienced one way or the other (for me it's the file input ` – Joeytje50 May 26 '14 at 11:47
  • @Joeytje50 `That trick could be really useful there.` Yes, definitely. `something every programmer has experienced one way or the other` Ha, too right. :) – zx81 May 26 '14 at 11:49
  • @zx81. That is interesting however, I wonder if this is truly a "regular expression" in the sense that it isn't matching a regular language. Props for the answer, I would really like to know how it works under the hood. – SystemFun May 26 '14 at 18:20