5

Honestly, I think I should first ask for your help with syntax of this question first.

But please if you can understand what I mean edit the title with suitable one.

Is there a way to make pattern that can split a text like this.

{{START}}
    {{START}}
        {{START}}
            {{START}}
            {{END}}
        {{END}}
    {{END}}
{{END}}

So every {{START}} matches its {{END}} from inside first to outside last!

And if I cannot do that with regex only. What about doing it using PHP?

Thank you up front.

  • 5
    It cannot be done with most flavors of regex, though there are tricks, beyond my ken, that make it possible in languages like Perl. Read about the pumping lemma to find out why you can't do this. – siride Jun 22 '13 at 04:05
  • i suppose your formatting some kind of input. If you explained a little more perhaps some alternative approach could be suggested. –  Jun 22 '13 at 04:18
  • It sounds like you're trying to parse something... [If the something is anywhere near as complex as HTML (looks so to me), doing it with regexes is a bad idea.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – michaelb958--GoFundMonica Jun 22 '13 at 04:24
  • Potential duplicates: [Recursive regex not matching template blocks](http://stackoverflow.com/q/15478935/367456); [Regex for checking if a string has mismatched parentheses?](http://stackoverflow.com/q/562606/367456); [How does regular expression engine parse regex with recursive subpatterns?](http://stackoverflow.com/q/11670111/367456); [Why will this recursive regex only match when a character repeats 2^n - 1 times?](http://stackoverflow.com/q/3738631/367456); [Regex to match top level delimiters in a multi dimensional string](http://stackoverflow.com/q/11467705/367456) – hakre Jun 22 '13 at 12:03
  • Please use the search before asking a question. Now with so many duplicates linked, please share what you've tried so far, provide reference of it and let us know why it didn't work for you.. – hakre Jun 22 '13 at 12:04

3 Answers3

4

This is beyond the capability of a regular expression, which can only parse regular grammars. What you're describing would require a pushdown automaton (regular languages are defined by a regular automaton).

You can use regular expression to parse the individual elements, but the "depth" part needs to be handled by a a language with a concept of memory (PHP is fine for this).

So in your solution, regexes will just be used for identifying your tags, while the real logic as to tracking depth and determining which element the END tag belongs to will must be your program itself.

Community
  • 1
  • 1
tylerl
  • 28,220
  • 12
  • 76
  • 108
  • 1
    PHP uses a regex engine that can more than just regular expressions. http://pcre.org/pcre.txt - so your answer is only of academic - not of practical matter. However you can use that engine as well to do how you outline it. Just the first part does not apply to PHP/PCRE. – hakre Jun 22 '13 at 11:55
3

It is possible! You can have each level of content using a recursive regular expression:

$data = <<<LOD
{{START1}}
    aaaaa
    {{START2}}
        bbbbb
        {{START3}}
            ccccc
            {{START4}}
                ddddd
            {{END4}}
        {{END3}}
    {{END2}}
{{END1}}
LOD;

$pattern = '~(?=({{START\d+}}(?>[^{]++|(?1))*{{END\d+}}))~';
preg_match_all ($pattern, $data, $matches);

print_r($matches);

explanations:

part: ({{START\d+}}(?>[^{]++|(?1))*{{END\d+}})

This part of the pattern describe a nested structure with {{START#}} and {{END#}}

(             # open the first capturing group
{{START\d+}}  
(?>           # open an atomic group (= backtracks forbidden)
    [^{]++    # all that is not a { one or more times (possessive)
  |           # OR
    (?1)      # refer to the first capturing group itself
)             # close the atomic group
{END\d+}}     # 
)             # close the first capturing group

Now the problem is that you can't capture all the level with this part only, because all the characters of the string are consumed by the pattern. In other words you can't match overlapped parts of the string.

The issue is to wrap all this part inside a zero-width assertion which doesn't consume characters like a lookahead (?=...), result:

(?=({{START\d+}}(?>[^{]++|(?1))*{{END\d+}}))

This will match all the levels.

Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
1

You cannot do this with pure RegEx, however with a simple loop it can be accomplished.

JS Example:

//[.\s\S]* ensures line breaks are matched (dotall not supported in JS)
var exp = /\{\{START\}\}([.\s\S]*)\{\{END\}\}/;

var myString = "{{START}}\ntest\n{{START}}\ntest 2\n{{START}}\ntest 3\n{{START}}\ntest4\n{{END}}\n{{END}}\n{{END}}\n{{END}}";

var matches = [];
var m = exp.exec(myString);
while ( m != null ) {
    matches.push(m[0]);
    m = exp.exec(m[1]);
}

alert(matches.join("\n\n"));

PHP (I have no idea if this is correct, it's been forever since I've done PHP)

$pattern = "/\{\{START\}\}([.\s\S]*)\{\{END\}\}/";
$myString = "{{START}}\ntest\n{{START}}\ntest 2\n{{START}}\ntest 3\n{{START}}\ntest4\n{{END}}\n{{END}}\n{{END}}\n{{END}}";

$result = preg_match($pattern, $myString, $matches, PREG_OFFSET_CAPTURE);
$outMatches = array();
while ( $result ) {
    array_push($outMatches, $matches[0]);
    $result = preg_match($pattern, $matches[1], $matches, PREG_OFFSET_CAPTURE);
}
print($outMatches);

Output:

{{START}}
test
{{START}}
test 2
{{START}}
test 3
{{START}}
test4
{{END}}
{{END}}
{{END}}
{{END}}

{{START}}
test 2
{{START}}
test 3
{{START}}
test4
{{END}}
{{END}}
{{END}}

{{START}}
test 3
{{START}}
test4
{{END}}
{{END}}

{{START}}
test4
{{END}} 
Matt MacLean
  • 17,724
  • 7
  • 47
  • 51