I've been days trying to find a solution WITH regex (before somebody says it: I know I should been using the PHP DOM Document library or something alike, but let's take this as a theoretical question), looking answers up and I finally came up with what I'll show near the end of this question.
What follows is just a summary of a lot of things I've tried before.
First of all, what I mean by nested tags of the same type is:
Text outside any div
<div id="my_id"> bla bla
<div>
bla bla bla
<div style="some style here">
lalalalala
</div>
</div>
I'm trapped in a div!
</div>
more text outside divs
<div>more divs here!
<div id="justbeingannoying">radiohead rules</div>
</div>
Now imagine I want to remove all the divs and their content using regex. So the intended result would be:
Text outside any div
more text outside divs
The first idea would be matching everything. The following regex matches div tags with properties (style, id, etc):
/<div[^>]*>.*<\/div>/sig
The problem, of course, is that this will match everything between the beginning of the first "< div" and the last "< /div >", so it will match "more text outside divs" too (check here: https://regex101.com/r/iR8mY2/1 ), which is not want we (I) want.
This could be solved using the U modifier (Ungreedy)
/<div[^>]*>.*<\/div>/sigU
but then we'll have the problem of having less than we want: it will match only from the first "< div" till the first "" (so, if we remove the matches, besides some unmatched tags, will have the text "I'm trapped in a div!", which we don't want).
So, I found a solution that works like a charm for nested parenthesis, square brackets, etc:
/\[([^\[\]]*+|(?R))*\]/si
Basically, what this does is finding an opening square bracket, then matching anything *that is neither an opening nor a closing square bracket * OR a recursive structure of that, finding a closing square bracket.
What I have working now is a bad solution: basically, first I replace all the opening tags with an square bracket (which can't be in my code, for other reasons), then the closing tag for a closing square bracket and then I use the previous regex. Not a very elegant solution, I know.
The thing is I really want to know how this could be done with just one regex. It seems obvious than replacing in the previous regex the "[" and the "]" by the html tags has to work. But is not that easy. The problem is the negation for characters ("[^.......]" doesn't work for strings like "div". It seems that something similar can be achieved by this:
.+?(?=<div>)
and, of course, the same for the closing tag
.+?(?=<\/div>
This is how, more or less, I arrived to this regex
/<div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))*<\/div>/gis
Which works exactly as the first regex I presented before: https://regex101.com/r/yU8pV3/1
So, here is my question: what is wrong with that regex?
Thank you!