25

The problem: Match an arbitrarily nested group of brackets in a flavour of regex such as Java's java.util.regex that supports neither recursion nor balancing groups. I.e., match the three outer groups in:

(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))

This exercise is purely academic, since we all know that regular expressions are not supposed to be used to match these things, just as Q-tips are not supposed to be used to clean ears.

Stack Overflow encourages self-answered questions, so I decided to create this post to share something I recently discovered.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
jaytea
  • 1,651
  • 13
  • 19
  • Why do you ask this question if you know Java doesn't support recursion? This statement `since we all know that regular expressions are not supposed to be used to match these things` is utter bologny. –  Nov 10 '17 at 16:43
  • Here is the regex `\((?:[^()]++|(?R))*\)` –  Nov 10 '17 at 16:50
  • 3
    @sln, I asked the question and answered it - please scroll down. That statement was sarcasm used to pre-emptively deflect those who would, as they always do, reply saying that regex is not the correct tool for this task. Unfortunately, my post here failed to prompt positive, insightful discussion, as demonstrated by your and other responses. – jaytea Nov 10 '17 at 16:58
  • 2
    I'm going over your regex right now. It works in Perl. I'm trying to do an equivalent in boost regex but it doesn't do undefined backreference before the fact. It could be because of this it is using a pseudo stack, and I'm gonna find out. So far I do this `(?(DEFINE)(?.*)(?.*\)(?!.*\k).*))(?=\()(?:(?=.*?\((?!.*?\k)(?&G1))(?=.*?\)(?!.*?\k)(?&G2)).)+?.*?(?=\k)[^(]*(?=\k$)` but it does not find any. –  Nov 10 '17 at 17:33
  • Yes, unfortunately I think it's as you said: boost doesn't support forward references. Did you find out anything? – jaytea Nov 11 '17 at 08:36
  • My bad. Of course, groups defined in and called by functions don't retain their captures for usage elsewhere. As to `boost doesn't support forward references`, I just changed my version so it does. You can try your regex's using it www.regexformat.com. It's a heavily modified boost engine which is now almost _Perl_. After some testing, it seems your regex exhibits strange behavior when the otherwise normal matches are put on different lines. Example `(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))\n` matches the entire line. –  Nov 17 '17 at 19:06
  • _(con't)_ where `(F(i(r(s)t)))\n ((S)(e)((c)(o))(n)d) (((((((Third)))))))` matches `((S)(e)((c)(o))(n)d)` and `(((((((Third)))))))` independently but matches `((S)(e)((c)(o))(n)d) (((((((Third)))))))` if there are any newlines after it, and always the last line. If you change to use _Dot-All_ it matches any independently and works as it should. To me, this behavior is typical of unbounded assertions and is proof of flawed recursive simulation. Or, maybe not, but something you should check. –  Nov 17 '17 at 19:18
  • @sln, I would hesitate to call this a "simulation" of regex recursion, since the extent to which this and similar tricks can be used to replicate recursive patterns is not fully clear to me. While this post shows it can be used to replicate it for the purpose of matching balanced structures (which is its most common application), I _highly_ doubt it can be used to replicate it in general and didn't mean to imply that in any of my writing. – jaytea Nov 18 '17 at 07:48
  • 1
    As for the behaviour you're observing with new lines, it's to be expected. We need all of those `.*` parts to potentially match line terminators in order for the method to work correctly with multiple lines. I could force this by changing every `.` to `[\s\S]` in the example (and `\2$` to `\2\z` to supersede the 'm' modifier), but I chose to keep it as simple as possible since it is already quite difficult to understand. – jaytea Nov 18 '17 at 07:54
  • Related: [the question for which an answer](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) starts with *"You can't parse [X]HTML with regex."*. – Peter Mortensen Jul 22 '20 at 12:50

2 Answers2

44

Indeed! It's possible using forward references:

(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)

Proof

Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.

No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.

That's great and all, but I want to match inner groups too!

OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:

(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$))) 

Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.

So... how the hell does this actually work?

I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:

Breakdown of the regular expression

Conclusion

So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regular expression features - no recursion or balanced groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.

I know a lot of you use regular expressions to accomplish and help other users accomplish simpler and more practical tasks, but if there is anyone out there who shares my excitement for pushing the limits of possibility with regular expressions then I'd love to hear from you. If there is interest, I have other similar material to post.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
jaytea
  • 1,651
  • 13
  • 19
  • 4
    `(But it is oh so fragile)(Why?)(Because: "a) you could have quoted brackets! :)")("That's not fair :(")` – AJNeufeld Nov 07 '17 at 16:10
  • 2
    @AJNeufeld Haha, this is just for you :) https://regex101.com/r/Dfao1h/1 – jaytea Nov 07 '17 at 18:35
  • @jaytea Close, but no cigar! The 4th match comes out as `(")`, instead of `("That's not fair :(")` – AJNeufeld Nov 07 '17 at 20:00
  • 1
    @AJNeufeld Small oversight, here: https://regex101.com/r/Dfao1h/4 . I'm a little saddened that people seem to have missed the point of this post :( Don't tell me there's a "b)" too? haha – jaytea Nov 08 '17 at 04:09
  • Of course there is a "b)"! `(Single quoted characters, like: '(', '"', and ')')`, not to mention `("c) Escaped \"quotes\" within quotes (\")")`. Then we have `("d) actual backslashes, which need escaping, like \\")`, so you can't just look for `\"`, because the `\ ` before the `"` could itself be escaped! I don't think people have missed the point that the post was "academical". I think the issue is this RegEx is unmaintainable: 90 characters of magic codes for the simple version? And it keeps getting longer with more special cases, like "e) parse a regex with your regex": `(\([^)]+\))` – AJNeufeld Nov 08 '17 at 19:02
  • 4
    @AJNeufeld I humoured you before, but now I think you're beating a dead horse. Handling character escapes and quoted strings is a very basic extension to my example and has been done many _many_ times before. This is why I say you are missing the point of this post. The point was to introduce something that hasn't been done before. Not to introduce something and claim it will work for every conceivable use case. Many others before me have posted similar demonstrations and have received much more positive feedback. That's why I figured my post fit would it in here. – jaytea Nov 09 '17 at 04:05
  • @jaytea Without the `$` sign inside the final validation, it would work multiline too... `(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2)` https://regex101.com/r/rKytow/1 – Piterden Nov 12 '17 at 08:18
  • @Denis Wrong, I'm afraid. That '$' is a necessary part of the validation. If `\2` = "" (which happens when the last ')' matched is at the end of the string), `(?=\2)` matches at every position. It will thus match "(" in "(()", which is not desirable. Please see: https://regex101.com/r/rKytow/2 My regex already has the capability of matching over multiple lines; you just need to use the 's' modifier to force '.' to match line terminators. Please see: https://regex101.com/r/rKytow/3 – jaytea Nov 12 '17 at 08:31
  • @Piterden I promise you don't need to change my regex in any way to get it to match over multiple lines. Your amendment only serves to break it in this case: https://regex101.com/r/rKytow/5 As I said: all you need is the 's' (DOTALL) modifier. – jaytea Nov 12 '17 at 09:43
  • 4
    Wow! That's a very interesting technique @jaytea, hats off. And yes, some of us actually enjoy reading high quality posts like this one. – Mariano Nov 18 '17 at 05:24
  • @Mariano, thank you so much for your kind words of encouragement! I'm very happy that this post has gotten more positive attention since the initial blip; it makes me keen to contribute further material :) I must also thank bobble-bubble for starting a bounty, and also whomever it is that appears to be going through my old answers and voting them up :D – jaytea Nov 18 '17 at 08:04
  • I'm deeply impressed - hats off! :-) A suggestion to improve this: you could turn this in to a regex matching something containing balanced parentheses by adding [^()]* at both sides. And using named-capturing groups allows easier embedding of this into ones own regexes (where you might want to add stuff and groups outside, e.g. to parse a function call), and it becomes possibly easier to understand. – Hans-Peter Störr May 10 '19 at 13:54
  • Hi @jaytea excellent post...how would one evolve the regex so to escape brackets `(` and `)` via escape sequence `\(` and `\)`? – bhreinb Dec 09 '20 at 14:10
  • Not everyone can see images. It would be really great if the explanation could become text instead. – Scratte Mar 03 '21 at 21:27
6

Brief

Input Corrections

First of all, your input is incorrect as there's an extra parenthesis (as shown below)

(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
                                ^

Making appropriate modifications to either include or exclude the additional parenthesis, one might end up with one of the following strings:

Extra parenthesis removed

(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
                                ^

Additional parenthesis added to match extra closing parenthesis

((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^

Regex Capabilities

Second of all, this is really only truly possible in regex flavours that include the recursion capability since any other method will not properly match opening/closing brackets (as seen in the OP's solution, it matches the extra parenthesis from the incorrect input as noted above).

This means that for regex flavours that do not currently support recursion (Java, Python, JavaScript, etc.), recursion (or attempts at mimicking recursion) in regular expressions is not possible.


Input

Considering the original input is actually invalid, we'll use the following inputs to test against.

(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))

Testing against these inputs should yield the following results:

  1. INVALID (no match)
  2. VALID (match)
  3. VALID (match)

Code

There are multiple ways of matching nested groups. The solutions provided below all depend on regex flavours that include recursion capabilities (e.g. PCRE).

See regex in use here

Using DEFINE block

(?(DEFINE)
  (?<value>[^()\r\n]+)
  (?<groupVal>(?&group)|(?&value))
  (?<group>(?&value)*\((?&groupVal)\)(?&groupVal)*)
)
^(?&group)$

Note: This regex uses the flags gmx

Without DEFINE block

See regex in use here

^(?<group>
  (?<value>[^()\r\n]+)*
  \((?<groupVal>(?&group)|(?&value))\)
  (?&groupVal)*
)$

Note: This regex uses the flags gmx

Without x modifier (one-liner)

See regex in use here

^(?<group>(?<value>[^()\r\n]+)*\((?<groupVal>(?&group)|(?&value))\)(?&groupVal)*)$

Without named (groups & references)

See regex in use here

^(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*)$

Note: This is the shortest possible method that I could come up with.


Explanation

I'll explain the last regex as it's a simplified and minimal example of all the other regular expressions above it.

  • ^ Assert position at the start of the line
  • (([^()\r\n]+)*\(((?1)|(?2))\)(?3)*) Capture the following into capture group 1
    • ([^()\r\n]+)* Capture the following into capture group 2 any number of times
      • [^()\r\n]+ Match any character not present in the set ()\r\n one or more times
    • \( Match a left/opening parenthesis character ( literally
    • ((?1)|(?2)) Capture either of the following into capture group 3
      • (?1) Recurse the first subpattern (1)
      • (?2) Recurse the second subpattern (2)
    • \) Match a right/closing parenthesis character ) literally
    • (?3)* Recurse the third subpattern (3) any number of times
  • $ Assert position at the end of the line
ctwheels
  • 19,377
  • 6
  • 29
  • 60
  • 3
    thanks for the post and pointing out the error in the question! I didn't downvote you, but I suspect whomever did probably did so because your suggestions uses recursion, which is not novel and not the point of this discussion. Incidentally, the solution with forward references is OK, it's just the input that was not OK. The expression still correctly matches complete, properly balanced groups of parentheses (leaving out the additional ')'), as it's supposed to. To validate a line for properly balanced parenthetical groups, you can use this: https://regex101.com/r/Dfao1h/2 – jaytea Nov 07 '17 at 18:44
  • 1
    yes, that was hastily made I'm afraid! This is corrected: https://regex101.com/r/Dfao1h/3 - I promise it's possible, and I urge you to go through the breakdown of the expression in my post and understand the method. I'm sure you'll understand how and why it works. The expression in the answer matches a full group, and the one I supplied just now validates (which is admittedly trickier). – jaytea Nov 07 '17 at 19:05
  • @jaytea the last one does work. You should update your answer to include that regex instead of the one your answer currently includes. – ctwheels Nov 07 '17 at 19:30
  • 1
    I just wanted to demonstrate the simplest variation of this concept, ie. "match a group" rather than "validate a group" or "match a group with possibly quoted contents" (like the other comment), etc. – jaytea Nov 08 '17 at 04:13