0

I am trying to figure out how to capture one statement if the other one doesn't exist using preg_match.

Sample Text:

<!-- InstanceBeginEditable name="doctitle" -->

<title>BU Libraries | Research Guides | Citing Your Sources</title>

<!-- InstanceEndEditable -->

<div id="standardpgt"><h1><!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable --></h1></div>

Because pagetitle exists I want to pull it instead of the doctitle tag. Of course there is tons of other characters in between them, but I wanted to show you a small sample.

If pagetitle didn't exist I would want to grab the contents of doctitle.

The twist is that I'm not using the php code directly, I'm passing in a regex statement through a config file, then a script is taking it and pulling out the 1st group from the statement.

This is what I came up with:

((?!.*?<!--\s*?InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->.*?<!--\s*?InstanceEndEditable\s*?-->)<!--\s*?InstanceBeginEditable\s*?name=\x22doctitle\x22\s*?-->\s*?<title>(.*?)<\/title>\s*?<!--\s*?InstanceEndEditable\s*?-->|<!-- InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->(.*?)<!--\s*?InstanceEndEditable\s*?-->)

What the issue is for some reason php always reads the first empty group as group 1 if it didn't work.

For example in the sample text above it would return

0 -> <!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable -->
1 -> 
2 -> <strong>Citing Your Sources</strong>

I can't for the life of figure out how to make this work. I also wrote this regex:

(?(?=.*?<!--\s*?InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->.*?<!--\s*?InstanceEndEditable\s*?-->).*?<!-- InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->(.*?)<!--\s*?InstanceEndEditable\s*?-->|.*?<!--\s*?InstanceBeginEditable\s*?name=\x22doctitle\x22\s*?-->\s*?<title>(.*?)<\/title>\s*?<!--\s*?InstanceEndEditable\s*?-->)

But that didn't work either. Thank you very much for the help.

Chris

Chris
  • 483
  • 3
  • 11
  • 1
    This is, incidentally, why you [shouldn't try parsing HTML with regexes](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Besides the insanity, there *are* better ways. Can you modify the system that takes a regex to do something else instead? Because seriously, the insanity... – Charles Mar 16 '11 at 23:06
  • Great excuse to find the comfortable chair, pull out the laptop and begin reading about the [PHP DOMDocument](http://www.php.net/manual/en/book.dom.php). – Brad Christie Mar 16 '11 at 23:33
  • Don't do it in one single regular expression. – mhitza Mar 16 '11 at 23:50
  • What modifier flags are you setting? i.e. `'/regex/xsi'`? – ridgerunner Mar 17 '11 at 05:58
  • I can modify the system, and there are better ways of doing it. This is just a time where you have to work with the tools that you were given, and then plan to change those tools when you have the resources. – Chris Mar 21 '11 at 17:49
  • And the script uses /si settings – Chris Mar 21 '11 at 17:50

2 Answers2

6

Just use the branch reset pattern: (?|...) around your whole expression, as in:

((?|(?!.*?<!--\s*?InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->.*?<!--\s*?InstanceEndEditable\s*?-->)<!--\s*?InstanceBeginEditable\s*?name=\x22doctitle\x22\s*?-->\s*?<title>(.*?)<\/title>\s*?<!--\s*?InstanceEndEditable\s*?-->|<!-- InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->(.*?)<!--\s*?InstanceEndEditable\s*?-->))s

From "man perlre":

"(?|pattern)" This is the "branch reset" pattern, which has the special property that the capture buffers are numbered from the same starting point in each alternation branch. It is available starting from perl 5.10.0.

Capture buffers are numbered from left to right, but inside this construct the numbering is restarted for each branch.

The numbering within each branch will be as normal, and any buffers following this construct will be numbered as though the construct contained only one branch, that being the one with the most capture buffers in it.

This construct will be useful when you want to capture one of a number of alternative matches.

Consider the following pattern. The numbers underneath show in which buffer the captured content will be stored.

         # before  ---------------branch-reset----------- after
         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
         # 1            2         2  3        2     3     4
jsalvata
  • 2,059
  • 13
  • 27
  • This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Groups". – aliteralmind Apr 10 '14 at 00:26
1

user178551 is absolutely correct in recommending the use of a branch reset construct. There is fundamentally nothing wrong with your original regex (other than the fact that it is more than 300 characters long and is ALL ON ONE LINE! - and that it is unable to put one of two alternatives in a single capture group). A non-trivial (to put it mildly) regex like this needs to be written in free-spacing mode with indentation so you can actually read it. Here is your original regex with some reasonable whitespace added:

$re_OP1 = '%
    (                                             # $1:
      (?!
        .*?<!--\s*?InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->
        .*?<!--\s*?InstanceEndEditable\s*?-->
      )
           <!--\s*?InstanceBeginEditable\s*?name=\x22doctitle\x22\s*?-->\s*?
           <title>(.*?)<\/title>\s*?              # $2: 
           <!--\s*?InstanceEndEditable\s*?-->
    |      <!-- InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->
           (.*?)                                  # $3;
           <!--\s*?InstanceEndEditable\s*?-->
    )
    %six';

Looking at this regex now, you can see where you have hard coded one space on the line with the OR operator (i.e. |<!-- InstanceBegin...). This will cause the regex to fail to match with the 'x' modifier is applied. So replacing this space with a \s* and running it on your test data, here are the result I get (php-5.2.14):

Array
(
    [0] => <!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable -->
    [1] => <!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable -->
    [2] =>
    [3] => <strong>Citing Your Sources</strong>
)

These results are similar to the ones you posted (but for some reason your results show only 2 capture groups???) All we need to do now is to apply user178551's branch reset suggestion, and the regex solution becomes:

$re_jmr = '%
    (?|  # Branch reset construct. (restart counting for each alternative)
      (?!
        .*?<!--\s*InstanceBeginEditable\s*name="pagetitle"\s*-->
        .*?<!--\s*InstanceEndEditable\s*-->
      )
           <!--\s*InstanceBeginEditable\s*name="doctitle"\s*-->\s*
           <title>(.*?)<\/title>\s*              # $1: Group 1A
           <!--\s*InstanceEndEditable\s*-->
    |      <!--\s*InstanceBeginEditable\s*name="pagetitle"\s*-->
           (.*?)                                  # $1: Group 1B
           <!--\s*InstanceEndEditable\s*-->
    )
    %six';

I've gone ahead and changed all the lazy \s*? to greedy (because greedy is what you want here). I also changed all the \x22 to just " - shorter and more readable IMHO. And here are the results from running with this new, branch reset regex:

Array
(
    [0] => <!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable -->
    [1] => <strong>Citing Your Sources</strong>
)

Which is, (if I'm not mistaken), exactly what you are looking for. (You did not provide a test case for the other alternative so that has not yet been tested.) Other than that, your original regex was pretty close.

ridgerunner
  • 30,685
  • 4
  • 51
  • 68
  • You are completely right. It was very hard to read. The reason I put the \x22 in instead of the " is because quotes break the way I currently get the regex to the php. It is currently a script that reads a config file and the regex patern is pulled from the config file which only accept single line settings. We are currently in the process of rewriting the entire process, because it is incredibly inefficient. Hopefully we will use xml parsing instead of regex in the future. But thank you so much for this great explanation, I really did learn a lot. – Chris Mar 21 '11 at 17:47