41

I found it in the following regex:

\[(?:[^][]|(?R))*\]

It matches square brackets (with their content) together with nested square brackets.

Emanuil Rusev
  • 31,853
  • 50
  • 124
  • 193

1 Answers1

69

[^][] is a character class that means all characters except [ and ].

You can avoid escaping [ and ] special characters since it is not ambiguous for the PCRE, the regex engine used in preg_ functions.

Since [^] is incorrect in PCRE, the only way for the regex to parse is that ] is inside the character class which will be closed later. The same with the [ that follows. It can not reopen a character class (except a POSIX character class [:alnum:]) inside a character class. Then the last ] is clear; it is the end of the character class. However, a [ outside a character class must be escaped since it is parsed as the beginning of a character class.

In the same way, you can write []] or [[] or [^[] without escaping the [ or ] in the character class.

Note: since PHP 7.3, you can use the inline xx modifier that allows blank characters to be ignored even inside character classes. This way you can write these classes in a less ambigous way like that: (?xx) [^ ][ ] [ ] ] [ [ ] [^ [ ].

You can use this syntax with several regex flavour: PCRE (PHP, R), Perl, Python, Java, .NET, GO, awk, Tcl (if you delimit your pattern with curly brackets, thanks Donal Fellows), ...

But not with: Ruby, JavaScript (except for IE < 9), ...

As m.buettner noted, [^]] is not ambiguous because ] is the first character, [^a]] is seen as all that is not a a followed by a ]. To have a and ], you must write: [^a\]] or [^]a]

In particular case of JavaScript, the specification allow [] as a regex token that never matches (in other words, [] will always fail) and [^] as a regex that matches any character. Then [^]] is seen as any character followed by a ]. The actual implementation varies, but modern browser generally sticks to the definition in the specification.

Pattern details:

\[          # literal [
(?:         # open a non capturing group
    [^][]   # a character that is not a ] or a [
  |         # OR
    (?R)    # the whole pattern (here is the recursion)
)*          # repeat zero or more time
\]          # a literal ]

In your pattern example, you don't need to escape the last ]

But you can do the same with this pattern a little bit optimized, and more useful cause reusable as subpattern (with the (?-1)): (\[(?:[^][]+|(?-1))*+])

(                     # open the capturing group
    \[                # a literal [
        (?:           # open a non-capturing group
            [^][]+    # all characters but ] or [ one or more time
          |           # OR
            (?-1)     # the last opened capturing group (recursion)
                      # (the capture group where you are)
        )*+           # repeat the group zero or more time (possessive)
    ]                 # literal ] (no need to escape)
)                     # close the capturing group

or better: (\[[^][]*(?:(?-1)[^][]*)*+]) that avoids the cost of an alternation.

Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • 29
    ... but you *should* escape those special characters since it confuses the hell out of the person maintaining your code. :) Regex authors tend to like "tricky" code (I have been guilty of this) but tricky code is hard-to-understand code. – cdhowie Jul 24 '13 at 21:22
  • 3
    Note that the important point here is that `]` occurs as the first character (after the negation), because empty classes are disallowed. – Martin Ender Jul 24 '13 at 21:23
  • 11
    @cdhowie: Especially if you also work in JavaScript where `[^]` is a valid regex (meaning "any character") – Tim Pietzcker Jul 24 '13 at 21:23
  • @cdhowie this is actually the only example in which I agree with this... in all other cases, I'd try to avoid as many escapes as possible by clever character placement - because escapes make regex a lot harder to read than they already are. – Martin Ender Jul 24 '13 at 21:24
  • 1
    @TimPietzcker and so is `[]`, which always fails. – Martin Ender Jul 24 '13 at 21:24
  • @m.buettner I'll buy that. Especially when you have to specify the regex as a string, in which case most languages will make you use two backslashes, one to escape the other in the string literal. (C#'s `@""` literal notation is rather nice for this purpose.) – cdhowie Jul 24 '13 at 21:25
  • 3
    @TimPietzcker Except in IEs < 9, where `[^]` acts like PHP does in this answer – Izkata Jul 24 '13 at 21:26
  • @m.buettner: Yes, but in MSIE quirks mode, it does behave like Izkata notes. – Tim Pietzcker Jul 24 '13 at 21:27
  • @m.buettner We discovered last week that older IEs do not support that when some javascript stopped working =) – Izkata Jul 24 '13 at 21:27
  • @Izkata yeah forget what I said... before you edited the comment, I read it the other way round... – Martin Ender Jul 24 '13 at 21:28
  • @m.buettner Too many people jumped on it at once. I actually deleted my first comment and wrote a new one that was phrased differently, when like 5+ new comments appeared above mine after submitting. (@_@) – Izkata Jul 24 '13 at 21:32
  • @cdhowie: the person that is not able to see that, don't have to maintain my code. – Casimir et Hippolyte Jul 24 '13 at 21:34
  • 2
    @CasimiretHippolyte It's not about whether they are *able* to parse the regex, but about whether or not you have made the intended meaning of your regex abundantly clear. Most languages don't consider whitespace significant, but we use it liberally by putting statements on their own line, for example. This is not for the benefit of the compiler, rather for the benefit of future maintainers. This is another such case where the compiler couldn't care less, but the next guy to maintain your code will appreciate not having to stop and try to guess what your intent was. – cdhowie Jul 24 '13 at 21:39
  • @CasimiretHippolyte you got Izkata's comment the wrong way round... JavaScript does not support this construct *except* in IE<9 – Martin Ender Jul 24 '13 at 22:04
  • 2
    @CasimiretHippolyte, I am gonna have to agree with cdhowie on this one... not using escape characters in this situation is horrible. Unless you are writing purely disposable code for your own personal amusement, someone *will have to maintain your code*. Don't be a jerk to that person just because you think they will not exist. – trognanders Jul 25 '13 at 01:50
  • @CasimiretHippolyte: `[]` does *not* match the empty string. It always fails to match. In essence, it means "match one character that is a member of this empty collection of characters" - and no such character can logically exist. – Tim Pietzcker Jul 25 '13 at 13:11
  • @TimPietzcker: it was the sense of my sentence :). Hopefully you have edited it to a more understandable form. – Casimir et Hippolyte Jul 25 '13 at 16:05
  • My preferred way to express this one is: `[[\]]` (looks the cleanest to me) – ridgerunner Aug 18 '13 at 19:37
  • This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Control Verbs and Recursion". – aliteralmind Apr 10 '14 at 01:11
  • Sorry, we shouldn't have declined that flag. That was a mistake, and I've removed the community wiki status. – Brad Larson Mar 08 '15 at 19:39