9

I am trying to write a regexp which matches nested parentheses, e.g.:

"(((text(text))))(text()()text)(casual(characters(#$%^^&&#^%#@!&**&#^*!@#^**_)))"

A string like this should be matched, cause all the nested parentheses are closed, instead:

"(((text)))(text)(casualChars*#(!&#*(!))"

Should not, or better, should match at least the first "(((text)))(text)" part.

Actually, my regexp is:

 $regex = '/( (  (\() ([^[]*?)  (?R)?  (\))  ){0,}) /x';

But it does not work properly as I am expecting. How to fix that? Where am I wrong? Thanks!

tonix
  • 5,862
  • 10
  • 61
  • 119
  • 2
    I wrote a parser for SQL that needed to recursively do this. It is far easier to have recursive functions with a regex, than try to do this recursively with regex alone. – EdgeCase Dec 13 '13 at 14:52
  • You're barking up the wrong tree, a purely regex solution will probably be overly complicated and difficult to maintain. You'd be better recursively parsing the string. – GordonM Dec 13 '13 at 14:52
  • Don't... Ok, in theory it can be done, but when you manage to do it, it will probably look like gliberish. Oh, look we found a bug in the regex! Ern... how do you fix that? Oh, we need to add support for brakets too! Ern... how do you add that? I tell you, you better use a more human readable parser. The fact that you are asking this, shows that you will probably be UNABLE to maintain it anyway. – Theraot Dec 13 '13 at 14:53
  • Thank you for your advices, but I still want to do that anyway, could you help me? Why isn't my regexp work as I am expecting? – tonix Dec 13 '13 at 15:03
  • @user3019105 What do you want to do with the matches exactly? (are you just validating, do you want to replace the content on the parenthesis, or just run a callback on each one) Also, do you want only the deepest parenthesis or you want them all? – Theraot Dec 13 '13 at 15:14
  • I am trying to trim off the comments of an email address in respecting of the RFC 5321/5322, which describe the standard of an email address. I want to catch in a $matches[1] all the couples of nested parentheses, but I want the regexp to match only until the **last** ")" correctly closed bracket, so if there are other mismatched parentheses ahead, the regexp stop matching. – tonix Dec 13 '13 at 15:29

3 Answers3

12

This pattern works:

$pattern = '~ \( (?: [^()]+ | (?R) )*+ \) ~x';

The content inside parenthesis is simply describe:

"all that is not parenthesis OR recursion (= other parenthesis)" x 0 or more times

If you want to catch all substrings inside parenthesis, you must put this pattern inside a lookahead to obtain all overlapping results:

$pattern = '~(?= ( \( (?: [^()]+ | (?1) )*+ \) ) )~x';
preg_match_all($pattern, $subject, $matches);
print_r($matches[1]);

Note that I have added a capturing group and I have replaced (?R) by (?1):

(?R) -> refers to the whole pattern (You can write (?0) too)
(?1) -> refers to the first capturing group

What is this lookahead trick?

A subpattern inside a lookahead (or a lookbehind) doesn't match anything, it's only an assertion (a test). Thus, it allows to check the same substring several times.

If you display the whole pattern results (print_r($matches[0]);), you will see that all results are empty strings. The only way to obtain the substrings found by the subpattern inside the lookahead is to enclose the subpattern in a capturing group.

Note: the recursive subpattern can be improved like this:

\( [^()]*+ (?: (?R) [^()]* )*+ \)
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • I have tried it but it does not work, and also I do not have the sub-pattern catched...Is there another way? – tonix Dec 13 '13 at 15:01
  • Thanks for the explanation. Need to play around with it in order to understand. – hek2mgl Dec 13 '13 at 15:08
  • Sorry for my ignorance,"(?1)" is the same as "\1"? Is it a backreference? And another question: what does double "++" mean after the [^()] class? – tonix Dec 13 '13 at 15:34
  • @user3019105: No it's different. `\1` refers to the matched content by the capturing group 1. `(?1)` refers to the subpattern inside the capturing group 1. You only repeat the subpattern. – Casimir et Hippolyte Dec 13 '13 at 15:50
  • @user3019105: When you put `(?1)` inside the first capturing group itself, you obtain a recursive subpattern. – Casimir et Hippolyte Dec 13 '13 at 15:56
  • Thank you! So it's the first sub-pattern inside the $matches[1] variable in this specific case, am I right? And about "++" and "*+" what are they? I see them for the first time...is there a reference or something or could you please explain them to me? Thank you! – tonix Dec 13 '13 at 15:59
  • @user3019105: `$matches[1]` contains the first capturing group results. About `++`, `*+`, `?+`, `{1,3}+`, these are possessive quantifiers. You can find more informations here: http://www.regular-expressions.info/possessive.html – Casimir et Hippolyte Dec 13 '13 at 16:05
  • So are possessive quantifiers as "Once-only subpatterns" described in the PHP Manual http://www.php.net/manual/en/regexp.reference.onlyonce.php? – tonix Dec 13 '13 at 16:57
  • Big +1 - Thanks for the lookahead trick - _Very Clever!_ (learn something new every day.) – ridgerunner Dec 13 '13 at 18:25
  • This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Control Verbs and Recursion". – aliteralmind Apr 10 '14 at 01:10
3

When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where { and }. So my approach was to make it more generic.

Here is a script to generate the regex pattern with your own variable left and right delimiters.

$delimiter_wrap  = '~';
$delimiter_left  = '{';/* put YOUR left delimiter here.  */
$delimiter_right = '}';/* put YOUR right delimiter here. */

$delimiter_left  = preg_quote( $delimiter_left,  $delimiter_wrap );
$delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );
$pattern         = $delimiter_wrap . $delimiter_left
                 . '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)'
                 . $delimiter_right . $delimiter_wrap;

/* Now you can use the generated pattern. */
preg_match_all( $pattern, $subject, $matches );
Axel
  • 3,108
  • 10
  • 30
  • 49
  • 1
    Nice, you match the whole string provided that there's always a `$delimiter_right` closing an opened `$delimiter_left` – tonix Jul 14 '16 at 07:30
0

The following code uses my Parser class (it's under CC-BY 3.0), it works on UTF-8 (thanks to my UTF8 class).

The way it works is by using a recursive function to iterate over the string. It will call itself each time it finds a (. It will also detect missmatched pairs when it reaches the end of the string without finding the corresponding ).

Also, this code takes a $callback parameter you can use to process each piece it finds. The callback recieves two parameters: 1) the string, and 2) the level (0 = deepest). Whatever the callback returns will be replaced in the contents of the string (this changes are visible at callback of higher level).

Note: the code does not includes type checks.

Non-recursive part:

function ParseParenthesis(/*string*/ $string, /*function*/ $callback)
{
    //Create a new parser object
    $parser = new Parser($string);
    //Call the recursive part
    $result = ParseParenthesisFragment($parser, $callback);
    if ($result['close'])
    {
        return $result['contents'];
    }
    else
    {
        //UNEXPECTED END OF STRING
        // throw new Exception('UNEXPECTED END OF STRING');
        return false;
    }
}

Recursive part:

function ParseParenthesisFragment(/*parser*/ $parser, /*function*/ $callback)
{
    $contents = '';
    $level = 0;
    while(true)
    {
        $parenthesis = array('(', ')');
        // Jump to the first/next "(" or ")"
        $new = $parser->ConsumeUntil($parenthesis);
        $parser->Flush(); //<- Flush is just an optimization
        // Append what we got so far
        $contents .= $new;
        // Read the "(" or ")"
        $element = $parser->Consume($parenthesis);
        if ($element === '(') //If we found "("
        {
            //OPEN
            $result = ParseParenthesisFragment($parser, $callback);
            if ($result['close'])
            {
                // It was closed, all ok
                // Update the level of this iteration
                $newLevel = $result['level'] + 1;
                if ($newLevel > $level)
                {
                    $level = $newLevel;
                }
                // Call the callback
                $new = call_user_func
                (
                    $callback,
                    $result['contents'],
                    $level
                );
                // Append what we got
                $contents .= $new;
            }
            else
            {
                //UNEXPECTED END OF STRING
                // Don't call the callback for missmatched parenthesis
                // just append and return
                return array
                (
                    'close' => false,
                    'contents' => $contents.$result['contents']
                );
            }
        }
        else if ($element == ')') //If we found a ")"
        {
            //CLOSE
            return array
            (
                'close' => true,
                'contents' => $contents,
                'level' => $level
            );
        }
        else if ($result['status'] === null)
        {
            //END OF STRING
            return array
            (
                'close' => false,
                'contents' => $contents
            );
        }
    }
}
Theraot
  • 18,248
  • 4
  • 45
  • 72
  • Light years behind posting function like this – Martin Zvarík Nov 28 '18 at 16:55
  • @MartinZvarík thanks for bringing my attention to this answer, I have fixed the links. This is meant to be easier to maintain than regex. – Theraot Nov 28 '18 at 23:17
  • I also love how you mentioned that it is under Creative Common license :DD ... Bro... just erase the whole thing and we will forget it was ever posted here... 5 years later you must know better – Martin Zvarík Nov 29 '18 at 01:53