3

I've been days trying to find a solution WITH regex (before somebody says it: I know I should been using the PHP DOM Document library or something alike, but let's take this as a theoretical question), looking answers up and I finally came up with what I'll show near the end of this question.

What follows is just a summary of a lot of things I've tried before.

First of all, what I mean by nested tags of the same type is:

Text outside any div
<div id="my_id"> bla bla
  <div>
  bla bla bla
    <div style="some style here">
      lalalalala
     </div>
   </div>
    I'm trapped in a div!
</div>
more text outside divs

<div>more divs here!
       <div id="justbeingannoying">radiohead rules</div>
</div>

Now imagine I want to remove all the divs and their content using regex. So the intended result would be:

Text outside any div
more text outside divs

The first idea would be matching everything. The following regex matches div tags with properties (style, id, etc):

/<div[^>]*>.*<\/div>/sig

The problem, of course, is that this will match everything between the beginning of the first "< div" and the last "< /div >", so it will match "more text outside divs" too (check here: https://regex101.com/r/iR8mY2/1 ), which is not want we (I) want.

This could be solved using the U modifier (Ungreedy)

/<div[^>]*>.*<\/div>/sigU

but then we'll have the problem of having less than we want: it will match only from the first "< div" till the first "" (so, if we remove the matches, besides some unmatched tags, will have the text "I'm trapped in a div!", which we don't want).

So, I found a solution that works like a charm for nested parenthesis, square brackets, etc:

/\[([^\[\]]*+|(?R))*\]/si

Basically, what this does is finding an opening square bracket, then matching anything *that is neither an opening nor a closing square bracket * OR a recursive structure of that, finding a closing square bracket.

What I have working now is a bad solution: basically, first I replace all the opening tags with an square bracket (which can't be in my code, for other reasons), then the closing tag for a closing square bracket and then I use the previous regex. Not a very elegant solution, I know.

The thing is I really want to know how this could be done with just one regex. It seems obvious than replacing in the previous regex the "[" and the "]" by the html tags has to work. But is not that easy. The problem is the negation for characters ("[^.......]" doesn't work for strings like "div". It seems that something similar can be achieved by this:

.+?(?=<div>)

and, of course, the same for the closing tag

.+?(?=<\/div>

This is how, more or less, I arrived to this regex

/<div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))*<\/div>/gis

Which works exactly as the first regex I presented before: https://regex101.com/r/yU8pV3/1

So, here is my question: what is wrong with that regex?

Thank you!

miltonlaufer
  • 115
  • 1
  • 8
  • The reason you're having so much trouble and the reason you're initially aware of the fact that you should use a library that knows how to properly interpret XML/HTML is that regex just isn't suitable for working with HTML. It's just not. Theoretical or not you just can't. – Marty Jun 14 '16 at 22:37
  • For a theoretical question, too many "we want to use" here. Anyway, [see here](https://regex101.com/r/nF6vM6/1) (and [here](https://regex101.com/r/nF6vM6/3)) and know it is not the right way to match nested `div` tags. – Wiktor Stribiżew Jun 14 '16 at 22:51
  • @WiktorStribiżew , thank you very much! That did trick. Would it be too much to ask you if you could elaborate the reason why that worked? – miltonlaufer Jun 15 '16 at 01:30
  • I can only elaborate on why it won't work in all cases. Imagine a `
    ` inside CDATA or a comment, and this regex is useless. **Regex should be used on plain text.** When you use a regex on a marked-up text, you do not have a guarantee that you match a node or plain text.
    – Wiktor Stribiżew Jun 15 '16 at 05:38
  • @WiktorStribiżew. Thank you, I just wanted to know how to make that regex to work, because I want to understand regex better and, having done a ton of research last days, it seems I'm not alone. After all, theoretical or not, «too many "we want to use"» or not (what was that? Is there any contradiction between wanting and theory? I have a Philosophy PhD and I've never read that was a problem), after all, I was saying, the point here is to share our programming knowledge, isn't it? – miltonlaufer Jun 15 '16 at 07:18
  • I see that your question is met with positive reaction, I will add an answer, with a disclaimer at the start. – Wiktor Stribiżew Jun 15 '16 at 07:18
  • @Marty: maybe I'm wrong, but I read the performance of a regex in cases in which you _can_ use a regex (I'm not saying this is the case, that would be to beg the question) is way better than the DOM management classes. What do you think? – miltonlaufer Jun 15 '16 at 07:20

1 Answers1

4

DISCLAIMER

Since the question is met with positive reaction, I will post an answer explaining what is wrong with your approach, and will show how to match text that is not some specific text.

HOWEVER, I want to emphasize: Do not use this to parse real, arbitrary HTML code, as regex should only be used on plain text.

What is wrong with your regex

Your regex contains <div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))* part (same as <div((.+?(?=<\/?div>))|(?R))*) before matching the closing <\/div> part. When you have some delimited text, do not rely on plain lazy/greedy dot matching (unless used in unroll the loop structure - when you know what you are doing). What it does is this:

  • <div - match <div literally (also, in <diverse due to a missing word boundary or a \s after it)
  • ( - Group 1 that matches:
    • (.+?(?=<\/div>)|.+?(?=<div>)) - matches either any 1+ chars (as few as possible) up to the first </div> or to the first <div>
    • |
    • (?R) - Recurse (i.e. insert and use)
  • )* - repeat Group 1 zero or more times.

The problem is clear: the (.+?(?=<\/?div>)) part does not exclude matching <div> or </div>, this branch MUST only match the text NOT EQUAL to the leading and trailing delimiters.

Solution(s)

To match text other than some specific text use a tempered greedy token.

<div\b[^<]*>((?:(?!<\/?div\b).)+|(?R))*<\/div>\s*
             ^^^^^^^^^^^^^^^^^^^ 

See the regex demo. Note you must use a DOTALL modifier so as to be able to match text across newlines. A capturing group is redundant, you can remove it.

What is important here is that (?:(?!<\/?div\b).)+ only matches 1 or more characters that are not the starting character of a <div....> or </div sequences. See my above linked thread on how that works.

As for performance, tempered greedy tokens are resource-consuming. Unroll the loop technique comes to the rescue:

<div\b[^<]*>(?:[^<]+(?:<(?!\/?div\b)[^<]*)*|(?R))*<\/div>\s*

See this regex demo

Now, the token looks like [^<]+(?:<(?!\/?div\b)[^<]*)*: 1+ characters other than < followed with 0+ sequences of < that is not followed with /div or div (as a whole word) and then again 0+ non-<s.

<div\b might still match in <div-tmp, so perhaps, <div(?:\s|>) is a better way to deal with this via regex. Still, parsing HTML with DOM is much easier.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397