4

I'm trying to match '<TAG2>' only if it's not inside of <TAG>.

For example:

This is a WORD --- Match
<TAG><TAG2>xxx</TAG2></TAG> --- Not a match
<TAG>xxxxxxx<TAG2>yyyy</TAG2>xxxxxxx</TAG>  --- Not a match

I'm using PHP so I can't do a variable length negative look-behind.

I tried using the regex in Match text not inside span tags, but this doesn't work in my case if there's multiple tags.

<TAG><TAG2>xxx</TAG2></TAG>
<TAG><TAG2>xxx</TAG2></TAG>  - This will match from the first <TAG2> to  the end of the second </TAG2>.  I'm assuming this is because my regex includes <TAG2>[\s\S]*</TAG2>
Community
  • 1
  • 1

1 Answers1

3

Foreward

I recommend using a parsing engine for this, however it sounds like you have creative control over the complexity of your HTML. So as long as you do not have complex nesting situations or other odd edge cases, then this should work.

Description

(<tag2>.*?</tag2>)|<tag>(?:(?!<tag\s?>).)*

Regular expression visualization

This regular expression will do the following:

  • populate capture group 1 with <tag2>...</tag2 providing this tag is not already enclosed inside <tag>...</tag> like <tag>.<tag2>..</tag2>.</tag>
  • This will also match all <tag>...<tag>, but where this match occurs the capture group 1 will have no value.

Example

Live Demo

https://regex101.com/r/uQ7xR5/1

Sample text

This <tag2>is a WORD</tag2> --- Match
<TAG><TAG2>xxx</TAG2></TAG> --- Not a match
<TAG>xxxxxxx<TAG2>yyyy</TAG2>xxxxxxx</TAG>  --- Not a match

Sample Matches

Note how capture group 1 is only popoulated by the <tag2>...</tag2 where it was not encapsulated inside <tag>..</tag>

[0][0] = <tag2>is a WORD</tag2>
[0][1] = <tag2>is a WORD</tag2>

[1][0] = <TAG><TAG2>xxx</TAG2></TAG> --- Not a match
[1][1] = 

[2][0] = <TAG>xxxxxxx<TAG2>yyyy</TAG2>xxxxxxx</TAG>  --- Not a match
[2][1] = 

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    <tag2>                   '<tag2>'
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
    </tag2>                  '</tag2>'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  <tag>                    '<tag>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
      <tag                     '<tag'
----------------------------------------------------------------------
      \s?                      whitespace (\n, \r, \t, \f, and " ")
                               (optional (matching the most amount
                               possible))
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    .                        any character except \n
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
Ro Yo Mi
  • 13,586
  • 4
  • 31
  • 40
  • I just found another interesting answer and I'd like to link it for other people: https://stackoverflow.com/questions/23589174/regex-pattern-to-match-excluding-when-except-between – Revious Jul 22 '19 at 21:30