Converting PCRE recursive regex pattern to .NET balancing groups definition

Question

PCRE has a feature called recursive pattern, which can be used to match nested subgroups. For example, consider the "grammar"

Q -> \w | '[' A ';' Q* ','? Q* ']' | '<' A '>'
A -> (Q | ',')*
// to match ^A$.

It can be done in PCRE with the pattern

^((?:,|(\w|\[(?1);(?2)*,?(?2)*\]|<(?1)>))*)$

(Example test case: http://www.ideone.com/L4lHE)

Should match:

abcdefg abc,def,ghi abc,,,def ,,,,,, [abc;] [a,bc;] sss[abc;d] as[abc;d,e] [abc;d,e][fgh;j,k] <abc> [<a>b;<c,d>,<e,f>] <a,b,c> <a,bb,c> <,,,> <> <><> <>,<> a<<<<>>><a>> <<<<<>>>><><<<>>>> <z>[a;b] <z[a;b]> [[;];] [,;,] [;[;]] [<[;]>;<[;][;,<[;,]>]>]

Should not match:

<a bc> <abc<de> [a<b;c>;d,e] [a] <<<<<>>>><><<<>>>>> <<<<<>>>><><<<>>> [abc;def;] [[;],] [;,,] [abc;d,e,f] [<[;]>;<[;][;,<[;,]>]]> <z[a;b>]

There is no recursive pattern in .NET. Instead, it provides balancing groups for stack-based manipulation for matching simple nested patterns.

Is it possible to convert the above PCRE pattern into .NET Regex style?

_{(Yes I know it's better not to use regex in for this. It's just a theoretical question.)}

References

Nice question. My guess would be "no", but I'd love to see if someone comes up with a way to do it. — EMP, Jul 28 '10 at 04:51
I think Perl is winning this battle. I might give it a shot at the evening, but it's too much for work :P — Kobi, Jul 28 '10 at 05:32

Kobi · Answer 1 · 2015-08-05T06:21:54.450

The .Net alternative to recursive pattern is a stack. The challenge here is that we need to express the grammar it terms of stacks.
Here's one way of doing that:

A slightly different notation for grammars

First, we need grammar rules (like A and Q in the question).
We have one stack. The stack can only contain rules.
At each step we pop the current state from the stack, match what we need to match, and push further rules into the stack. When we're done with a state we don't push anything and get back to the previous state.

This notation is somewhere between CFG and Pushdown automaton, where we push rules to the stack.

Example:

Let's start with a simple example: aⁿbⁿ. The usual grammar for this language is:

S -> aSb | ε

We can rephrase that to fit the notation:

# Start with <push S>
<pop S> -> "a" <push B> <push S> | ε
<pop B> -> "b"

In words:

We start with S in the stack.
When we pop S from the stack we can either:
- Match nothing, or...
- match "a", but then we have to push the state B to the stack. This is a promise we will match "b". Next we also push S so we can keep matching "a"s if we want to.
When we've matched enough "a"s we start popping Bs from the stack, and match a "b" for each one.
When this is done, we've matched an even number of "a"'s and "b"s.

or, more loosely:

When we're in case S, match "a" and push B and then S, or match nothing.
When we're in case B, match "b".

In all cases, we pop the current state from the stack.

Writing the pattern in a .Net regular expression

We need to represent the different states somehow. We can't pick '1' '2' '3' or 'a' 'b' 'c', because those may not be available in our input string - we can only match what is present in the string.
One option is to number our states (In the example above, S would state number 0, and B is state 1).
For state S we can push characters to the stack. For convenience, we'll push the first characters from the start of the string. Again, we don't care what are these characters, just how many there are.

Push state

In .Net, if we want to push the first 5 characters of a string to a stack, we can write:

(?<=(?=(?<StateId>.{5}))\A.*)

This is a little convoluted:

(?<=…\A.*) is a lookbehind that goes all the way to the start of the string.
When we're at the start there's a look ahead: (?=…). This is done so we can match beyond the current position - if we are at position 2, we don't have 5 characters before us. So we're looking back and looking forward.
(?<StateId>.{5}) push 5 characters to the stack, indicating that at some point we need to match state 5.

Pop state

According to our notation, we always pop the top state from the stack. That is easy: (?<-StateId>).
But before we do that, we want to know what state that was - or how many characters it captured. More specifically, we need to check explicitly for each state, like a switch/case block. So, to check if the stack currently holds state 5:

(?<=(?=.{5}(?<=\A\k<StateId>))\A.*)

Again, (?<=…\A.*) goes all the way to the start.
Now we advance (?=.{5}…) by five characters...
And use another lookbehind, (?<=\A\k<StateId>) to check that the stack really has 5 characters.

This has an obvious drawback - when the string is too short, we cannot represent the number of large states. This problem has solutions:

The number of short words in the language is final anyway, so we can add them manually.
Use something more complicated than a single stack - we can use several stacks, each with zero or one characters, effectively bits of our state (there's an example at the end).

Result

Our pattern for aⁿbⁿ looks like this:

\A
# Push State A, Index = 0
(?<StateId>)
(?:
    (?:
        (?:
            # When In State A, Index = 0
            (?<=(?=.{0}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                # Push State B, Index = 1
                (?<=(?=(?<StateId>.{1}))\A.*)
                a
                # Push State A, Index = 0
                (?<StateId>)
                |

            )
        )
        |
        (?:
            # When In State B, Index = 1
            (?<=(?=.{1}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            b
        )
        |\Z
    ){2}
)+
\Z
# Assert state stack is empty
(?(StateId)(?!))

Working example on Regex Storm

Notes:

Note that the quantifier around the states is (?:(?:…){2})+ - that is, (state count)×(input length). I also added an alternation for \Z. Let's not get into that, but it's a workaround for an annoying optimization in the .Net engine.
The same can be written as (?<A>a)+(?<-A>b)+(?(A)(?!)) - this is just an exercise.

To answer the question

The grammar from the question can be rewritten as:

# Start with <push A>
<pop A> -> <push A> ( @"," | <push Q> ) | ε
<pop Q> -> \w
           | "<" <push Q2Close> <push A>
           | "[" <push Q1Close> <push QStar> <push Q1Comma> <push QStar> <push Q1Semicolon> <push A>
<pop Q2Close> -> ">"
<pop QStar> -> <push QStar> <push Q> | ε 
<pop Q1Comma> -> ","?
<pop Q1Semicolon> -> ";"
<pop Q1Close> -> "]"

The pattern:

\A
# Push State A, Index = 0
(?<StateId>)
(?:
    (?:
        (?:
            # When In State A, Index = 0
            (?<=(?=.{0}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                # Push State A, Index = 0
                (?<StateId>)
                (?:
                    ,
                    |
                    # Push State Q, Index = 1
                    (?<=(?=(?<StateId>.{1}))\A.*)
                )
                |

            )
        )
        |
        (?:
            # When In State Q, Index = 1
            (?<=(?=.{1}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                \w
                |
                <
                # Push State Q2Close, Index = 2
                (?<=(?=(?<StateId>.{2}))\A.*)
                # Push State A, Index = 0
                (?<StateId>)
                |
                \[
                # Push State Q1Close, Index = 6
                (?<=(?=(?<StateId>.{6}))\A.*)
                # Push State QStar, Index = 3
                (?<=(?=(?<StateId>.{3}))\A.*)
                # Push State Q1Comma, Index = 4
                (?<=(?=(?<StateId>.{4}))\A.*)
                # Push State QStar, Index = 3
                (?<=(?=(?<StateId>.{3}))\A.*)
                # Push State Q1Semicolon, Index = 5
                (?<=(?=(?<StateId>.{5}))\A.*)
                # Push State A, Index = 0
                (?<StateId>)
            )
        )
        |
        (?:
            # When In State Q2Close, Index = 2
            (?<=(?=.{2}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            >
        )
        |
        (?:
            # When In State QStar, Index = 3
            (?<=(?=.{3}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                # Push State QStar, Index = 3
                (?<=(?=(?<StateId>.{3}))\A.*)
                # Push State Q, Index = 1
                (?<=(?=(?<StateId>.{1}))\A.*)
                |

            )
        )
        |
        (?:
            # When In State Q1Comma, Index = 4
            (?<=(?=.{4}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            ,?
        )
        |
        (?:
            # When In State Q1Semicolon, Index = 5
            (?<=(?=.{5}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            ;
        )
        |
        (?:
            # When In State Q1Close, Index = 6
            (?<=(?=.{6}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            \]
        )
        |\Z
    ){7}
)+
\Z
# Assert state stack is empty
(?(StateId)(?!))

^{Sadly, it is too long to fit in a url, so no online example.}

If we use "binary" stacks with one or zero characters, it would have looked like this: https://gist.github.com/kobi/8012361

Here's a screenshot of the pattern passing all tests: http://i.stack.imgur.com/IW2xr.png

Bonus

The .Net engine can do more than just to balance - it can also capture each instance of A or Q. This requires a slight modification of the pattern: https://gist.github.com/kobi/8156968 .
Note that we've added the groups Start, A and Q to the pattern, but they have no effect of the flow, they are used purely for capturing.

The result: for example, for the string "[<a>b;<c,d>,<e,f>]", we can get these Captures:

Group A
    (0-17) [<a>b;<c,d>,<e,f>]
    (1-4) <a>b
    (2-2) a
    (7-9) c,d
    (13-15) e,f
Group Q
    (0-17) [<a>b;<c,d>,<e,f>]
    (1-3) <a>
    (2-2) a
    (4-4) b
    (6-10) <c,d>
    (7-7) c
    (9-9) d
    (12-16) <e,f>
    (13-13) e
    (15-15) f

Open questions

Can every grammar be converted to the state-stack notation?
Is (state count)×(input length) enough steps to match all words? What other formula can work?

Source code

The code used to generate these patterns and all test cases can be found on https://github.com/kobi/RecreationalRegex

score 10 · Answer 2 · edited Jun 20 '20 at 09:12

The answer is _^(probably) Yes.

The technique is much more complex than the (?1) recursive call, but the result is almost 1-to-1 with the rules of the grammar - I worked in a such methodical way I can easily see it scripted. Basically, you match block-by-block, and use the stack to keep track of where you are. This is an almost working solution:

^(?:
    (\w(?<Q>)) # Q1
    |
    (<(?<Angle>))  #Q2 - start <
    |
    (\>(?<-Angle>)(?<-A>)?(?<Q>))  #Q2 - end >, match Q
    |
    (\[(?<Block>))  # Q3 start - [
    |
    (;(?<Semi-Block>)(?<-A>)?)  #Q3 - ; after [
    |
    (\](?<-Semi>)(?<-Q>)*(?<Q>))  #Q3 ] after ;, match Q
    |
    ((,|(?<-Q>))*(?<A>))   #Match an A group
)*$
# Post Conditions
(?(Angle)(?!))
(?(Block)(?!))
(?(Semi)(?!))

It is missing the part of allowing commas in Q->[A;Q*,?Q*], and for some reason allows [A;A], so it matches [;,,] and [abc;d,e,f]. Rest of the strings match the same as the test cases.
Another minor point is an issue with pushing to the stack with an empty capture - it doesn't. A accepts Ø, so I had to use (?<-A>)? to check if it captured.

The whole regex should look like this, but again, it is useless with the bug there.

Why it isn't working?

There is not way of synchronizing the stacks: if I push (?<A>) and (?<B>), I can pop them in any order. That is why this pattern cannot differentiate <z[a;b>] from <z[a;b]>... we need one stack for all.
This can be solved for simple cases, but here we have something much more complicate - A whole Q or A, not just "<" or "[".

As a side note, I'm surprised I found so little material on the subject. I learned as I go, an it took a great deal of time. Anyway, if anyone finds the flaw in my code I'll be happy to hear. — Kobi, Jul 28 '10 at 20:12
This question still hunts me. I know my answer is fundamentally wrong, since it doesn't take order into account, similarly to http://kobikobi.wordpress.com/2010/12/14/net-regex-matching-mixed-balanced-parentheses/ — Kobi, Feb 18 '11 at 14:43
If you don't mind, would you please answer this question with the solution in your blog? http://stackoverflow.com/questions/26502468/controlling-balanced-parenthesis — nhahtdh, Oct 23 '14 at 03:08
@nhahtdh - I am currently on vacation in Japan! Feel free to add or edit your answer Thanks! — Kobi, Oct 23 '14 at 15:11