1

Regex fans, hi again. Last time, Casimir and Hippolyte found an elegant solution to my problem.

Regex: matching open/close tags which accepts another open/close tag with same name

There has been a little change in the program since and starting from his (her?) regex, I could manage to find a working solution. However, I'm not completely satisfied with it.

The thing is that now there are two types of components:

  • those which opening tag ends with plus (+)
  • those which opening tag ends with minus (-)

However, they both have the same ending tag. Also, both types of components can contain the other type (a plus can contain the minus type and vice versa).

I need to get the content of "plus components" only.

<?php



$subject = '

{{poo+}}            # T1
    Hello

    {{poo-}}         # T2
        Nested 1
    {{/poo}}        # T3

{{/poo}}            # T4


{{poo+}}             # T5
    Bye
{{/poo}}            # T6



';



// The solution below works, but I'm forced to capture all types of components.
// I can differentiate them later using php...but I'm looking for a regex that does that immediately.
//
// The reason why is that in the real program, there are three components types, and the syntax is
// slightly more complex (so the regex would be slower to try all three types of components than just one),
// and there could be more component instances.

$p = '`(?x)
{{(\w+)([+-])}}
# ( # you need probably this capture group later
(?>
    [^{]++
  |
    { (?!{)
  |
    {{ (?! /? \1 \b) # if needed you can add }} in the lookahead
  |
    (?R)
)*
# )
{{/\1}}
`';


preg_replace_callback($p, function($match){
    var_dump($match);
}, $subject);
Community
  • 1
  • 1
ling
  • 7,505
  • 3
  • 41
  • 39

2 Answers2

1

Bonjour ling,

This is probably one of the most interesting regex questions of the week.

I haven't studied what the other guys did in detail, so starting from scratch this is what I would suggest.

(?x)
(?>{{(?:[\w-]+(\+)?)}}
  (?:
    [^{}]++
    |
    ((?>{{[\w+-]+}}(?:[^{}]++|(?>(?2)))+{{/[\w+-]+}}))
  )++
{{/[\w+-]+}}
)
(?(1)|(*SKIP)(?!))

How does it work?

The key is quite simple: we match the outside delimiter with {{(?:[\w-]+(\+)?)}}, optionally capturing the + in poo+ to Group 1 if it is there. This allows us, at the very end, in the (?(1)|(*SKIP)(?!)), to check if we had the correct delimiter at the start (conditional check for Group 1). If yes, at that stage the match succeeds. If no, we skip the whole match, preventing the engine from attempting a match on a nested set.

Other details: Between the delimiters, we match this expression any number of times:

[^{}]++
|
((?>{{[\w+-]+}}(?:[^{}]+|(?2))+{{/[\w+-]+}}))
  1. The top line [^{}]++ allows us to match any content that is not a left or right brace.
  2. As you know, the middle line is an OR
  3. The whole bottom line is captured to Group 2, and it references itself with the (?2) subroutine call. This line is a recursive blueprint to match sets of braces nested inside the outer expression.

An Irrelevant Detail

When you said his (her?) you meant h(?:is|er), right? :)

zx81
  • 38,175
  • 8
  • 76
  • 97
  • If I could, I would have accepted your answer as well. On 10000 iterations, it turns out that your solution is slighty faster than Casimir and Hippolyte's (0.18 vs 0.185). However, I ended choosing CaH's solution because in h(?:is|er) solution the component's definition need only to be defined once. In your solution, we have to define it at the entry of the pattern, and also at the recursion level. On the other hand, your solution seems more flexible, so I really don't know yet which one will eventually be implemented. Thank you for bringing an alternative approach (the more the better). – ling May 25 '14 at 06:30
  • Oh... mmm... If I had known you cared about that, there are obvious tweaks I would have made so that code is not repeated. That's actually what I would do for my own regex, but for most people that's too hard to read. Too bad, but of course Casimir always comes up with brilliant solutions so I can't complaint there. – zx81 May 25 '14 at 08:48
  • @ling At first glance, I can get it down to 115 printable chars, vs 103 for C&H (removing his comments), so he clearly wins on that count. Still, if size of regex code is going to be a criterion, I suggest announcing that upfront—I would probably have made my answer shorter also. – zx81 May 25 '14 at 09:00
  • It's not really about the size, but more about the dry approach. If I knew this was going to be a criterion I would have said that in the question too. Your anwser is a valuable asset for me, and, I guess, for any further reader. Too sad I had to accept one answer only. – ling May 25 '14 at 12:16
1

What you can do to ensure that you matches a poo+ tag without breaking the possibility of recursion call, is to replace ([+-]) by a conditional that tests if the recursion level has been reached. Example:

$p = '`(?x)
{{(\w+) (?(R)[+-]|\+) }}
# ( # you need probably this capture group later
(?>
    [^{]++
  |
    { (?!{)
  |
    {{ (?! /? \1 \b) # if needed you can add }} in the lookahead
  |
    (?R)
)*
# )
{{/\1}}
`';

It is a simple IF..THEN..ELSE:

(?(R)     # IF    the recursion level has been reached
    [+-]  # THEN  matches any kind of tags
  | \+    # ELSE  matches only + tags
)
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113