This should work for most well-formed markup, provided you aren't in a CDATA section and haven't played nasty games redefining entities:
# nasty, ugly, illegible, unmaintable — NEVER USE THIS STYLE!!!!
/<\w+(?:\s+\w+=(?:\S+|(['"])(?:(?!\1).)*?\1))*\s*\/?>/s
or more legibly, as
# broken out into related elements grouped by whitespace via /x
/ < \w+ (?: \s+ \w+ = (?: \S+ | (['"]) (?: (?! \1) . ) *? \1 )) * \s* \/? > /xs
and even more legibly as this:
/
# start of tag, with named ident
< \w+
# now with unlimited k=v pairs
# where k is \w+
# and v is either \S+ or else quoted
(?: \s+ \w+ = (?: \S+ # either an unquoted value,
| ( ['"] ) # or else first pick either quote
(?:
(?! \1) . # anything that isn't our quote, including brackets
) * ? # maximal should probably work here
\1 # till we see it again
)
) * # as many k=v pairs as we can find
\s * # tolerate closing whitespace
\/ ? # XHTML style close tag
> # finally done
/xs
There is a bit of slop you could add there, like tolerating whitespace in a few places where I don’t above.
PHP isn’t necessarily the best language for this sort of work, although you can make do in a pinch. And the very least, you should hide this stuff in a function and/or variable somewhere, not leave it exposed all naked-like, consider that The Children Are Watching™.
To do anything more complicated than finding oh I dunno letters or whitespace, patterns benefit greatly from comments and whitespace. That should go without saying, but for some reason people forget to use /x
for cognitive chunking, letting whitespace group related things just as you do with imperative code.
Even though they are declarative programs not imperative ones, even moreso do patterns benefit from full problem decomposition and top-down design. One way to do realize this is where you have "regex subroutines" that you declare separately from where you use them. Otherwise you’re just doing cut&paste code reuse, which is code reuse of the pessimal sort. Here is an example pattern for matching an <img>
tag, this time using real Perl:
my $img_rx = qr{
# save capture in $+{TAG} variable
(?<TAG> (?&image_tag) )
# remainder is pure declaration
(?(DEFINE)
(?<image_tag>
(?&start_tag)
(?&might_white)
(?&attributes)
(?&might_white)
(?&end_tag)
)
(?<attributes>
(?:
(?&might_white)
(?&one_attribute)
) *
)
(?<one_attribute>
\b
(?&legal_attribute)
(?&might_white) = (?&might_white)
(?:
(?"ed_value)
| (?&unquoted_value)
)
)
(?<legal_attribute>
(?: (?&required_attribute)
| (?&optional_attribute)
| (?&standard_attribute)
| (?&event_attribute)
# for LEGAL parse only, comment out next line
| (?&illegal_attribute)
)
)
(?<illegal_attribute> \b \w+ \b )
(?<required_attribute>
alt
| src
)
(?<optional_attribute>
(?&permitted_attribute)
| (?&deprecated_attribute)
)
# NB: The white space in string literals
# below DOES NOT COUNT! It's just
# there for legibility.
(?<permitted_attribute>
height
| is map
| long desc
| use map
| width
)
(?<deprecated_attribute>
align
| border
| hspace
| vspace
)
(?<standard_attribute>
class
| dir
| id
| style
| title
| xml:lang
)
(?<event_attribute>
on abort
| on click
| on dbl click
| on mouse down
| on mouse out
| on key down
| on key press
| on key up
)
(?<unquoted_value>
(?&unwhite_chunk)
)
(?<quoted_value>
(?<quote> ["'] )
(?: (?! \k<quote> ) . ) *
\k<quote>
)
(?<unwhite_chunk>
(?:
# (?! [<>'"] )
(?! > )
\S
) +
)
(?<might_white> \s * )
(?<start_tag>
< (?&might_white)
img
\b
)
(?<end_tag>
(?&html_end_tag)
| (?&xhtml_end_tag)
)
(?<html_end_tag> > )
(?<xhtml_end_tag> / > )
)
}six;
Yup, it gets long, but by getting longer it becomes more maintainable, not less. It is also more correct. Now, the real program that it is used in does more than just that, because you have to account for quite a bit more than that in real HTML, such as CDATA and encodings and naughty redefinitions of entities. However, contrary to popular belief, you can actually do that sort of thing with PHP, because it uses PCRE, which allows for (?(DEFINE)...)
blocks and recursive patterns. I have more seriousish examples of this sort of thing in my answers here, here, here, here, and here.
Ok, good, did you read all those, or at least glance at them? Still with me? Hello?? Don’t forget to breathe. There there, you’ll be ok now. :)
Certainly there is a large grey area where the possible gives way to the inadvisable, and far more quickly than it yields to the impossible. If those examples in those answers, let alone these in this current one, are beyond your own current skill level with pattern matching, then you probably should use something else, which often means getting someone else to do it for you.