2

How can I use regex to retrieve all html tag names within an html snippet? I'm using PHP to do this if it matters. For example:

<div id="someid">
     <img src="someurl" />
     <br />
     <p>some content</p>
</div>

should return: div, img, br, p.

VinnyD
  • 3,361
  • 8
  • 32
  • 46

4 Answers4

3

This should work for most well-formed markup, provided you aren't in a CDATA section and haven't played nasty games redefining entities:

# nasty, ugly, illegible, unmaintable — NEVER USE THIS STYLE!!!!
/<\w+(?:\s+\w+=(?:\S+|(['"])(?:(?!\1).)*?\1))*\s*\/?>/s

or more legibly, as

# broken out into related elements grouped by whitespace via /x
/ < \w+ (?: \s+ \w+ = (?: \S+ | (['"]) (?: (?! \1) . ) *? \1 )) * \s* \/? > /xs

and even more legibly as this:

/ 
   # start of tag, with named ident
   < \w+ 
   # now with unlimited k=v pairs 
   #    where k is \w+ 
   #      and v is either \S+ or else quoted 
   (?: \s+ \w+ = (?: \S+        # either an unquoted value, 
                   | ( ['"] )   # or else first pick either quote
                     (?: 
                        (?! \1) .  # anything that isn't our quote, including brackets
                     ) * ?     # maximal should probably work here
                     \1        # till we see it again
                 ) 
   )  *    # as many k=v pairs as we can find
   \s *    # tolerate closing whitespace

   \/ ?    # XHTML style close tag
   >       # finally done
/xs

There is a bit of slop you could add there, like tolerating whitespace in a few places where I don’t above.

PHP isn’t necessarily the best language for this sort of work, although you can make do in a pinch. And the very least, you should hide this stuff in a function and/or variable somewhere, not leave it exposed all naked-like, consider that The Children Are Watching™.

To do anything more complicated than finding oh I dunno letters or whitespace, patterns benefit greatly from comments and whitespace. That should go without saying, but for some reason people forget to use /x for cognitive chunking, letting whitespace group related things just as you do with imperative code.

Even though they are declarative programs not imperative ones, even moreso do patterns benefit from full problem decomposition and top-down design. One way to do realize this is where you have "regex subroutines" that you declare separately from where you use them. Otherwise you’re just doing cut&paste code reuse, which is code reuse of the pessimal sort. Here is an example pattern for matching an <img> tag, this time using real Perl:

my $img_rx = qr{

    # save capture in $+{TAG} variable
    (?<TAG> (?&image_tag) )

    # remainder is pure declaration
    (?(DEFINE)

        (?<image_tag>
            (?&start_tag)
            (?&might_white) 
            (?&attributes) 
            (?&might_white) 
            (?&end_tag)
        )

        (?<attributes>
            (?: 
                (?&might_white) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            \b
            (?&legal_attribute)
            (?&might_white) = (?&might_white) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<required_attribute>
            alt
          | src
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.

        (?<permitted_attribute>
            height
          | is map
          | long desc
          | use map
          | width
        )

        (?<deprecated_attribute>
             align
           | border
           | hspace
           | vspace
        )

        (?<standard_attribute>
            class
          | dir
          | id
          | style
          | title
          | xml:lang
        )

        (?<event_attribute>
            on abort
          | on click
          | on dbl click
          | on mouse down
          | on mouse out
          | on key down
          | on key press
          | on key up
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<might_white>     \s *   )

        (?<start_tag>  
            < (?&might_white) 
            img 
            \b       
        )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

    )

}six;

Yup, it gets long, but by getting longer it becomes more maintainable, not less. It is also more correct. Now, the real program that it is used in does more than just that, because you have to account for quite a bit more than that in real HTML, such as CDATA and encodings and naughty redefinitions of entities. However, contrary to popular belief, you can actually do that sort of thing with PHP, because it uses PCRE, which allows for (?(DEFINE)...) blocks and recursive patterns. I have more seriousish examples of this sort of thing in my answers here, here, here, here, and here.

Ok, good, did you read all those, or at least glance at them? Still with me? Hello?? Don’t forget to breathe. There there, you’ll be ok now. :)

Certainly there is a large grey area where the possible gives way to the inadvisable, and far more quickly than it yields to the impossible. If those examples in those answers, let alone these in this current one, are beyond your own current skill level with pattern matching, then you probably should use something else, which often means getting someone else to do it for you.

Community
  • 1
  • 1
tchrist
  • 74,913
  • 28
  • 118
  • 169
  • Not encountered `(?(DEFINE)...)` before. Do you know if it is Perl and PCRE only, or are there any other implementations that support it? (I'm not getting anything useful out of Google.) – Peter Boughton Aug 24 '11 at 21:54
  • @Peter: Yeah, these are the sorts of things that are impossible to Google up because of the way non-alphanumerics are discarded and case is ignored. I wasn’t paying attention when it happened, but my hunch is that Perl got it from PCRE, not the other way around. I don’t know what else if anything supports it. There are plenty of issues with it — if you look at my programs, I am forced to duplicate stuff for common subroutine definitions due to a lack of namespace control/patrol — but it is still pretty darned cool. – tchrist Aug 24 '11 at 22:34
1

I guess this should work ... I'll try it in a minute:

edit: removed \s+ (thanks to Peteris)

preg_match_all('/<(\w+)[^>]*>/', $html, $matched_elements);
Teneff
  • 23,912
  • 8
  • 52
  • 85
  • 1
    It will not work on `

    `. A fix is `'/|\s+[^>]*>)/'`.

    – Peteris Aug 24 '11 at 20:16
  • 2
    It will not work correctly on ``. – CanSpice Aug 24 '11 at 20:19
  • @CanSpice: So what? Don't make me show you how to do it! Plus, do we know anything but the data? No. It is perfectly possible that you those are not in the data, which may not be open-ended at all. – tchrist Aug 24 '11 at 20:34
  • 3
    @tchrist: So he should use an HTML parser to parse HTML. He should use the right tool for the job. – CanSpice Aug 24 '11 at 20:49
  • @CanSpice: I am not ready to say that. I use search and replace in `vi` when editing HTML. If that is permitted, then certainly you should be allowed to use pattern matching on HTML. If you are not allowed to, then you should not be allowed to use `vi` on those files. HTML is text — complicated text, I grant you, but still just text. There is nothing whatsoever wrong with writing `://, //s/
    //` in `vi`, and therefore there is nothing wrong with writing the equivalent in your programming language of choice. Quit shoving newbies at nontext solutions.
    – tchrist Aug 24 '11 at 21:29
1

Regexes might not always work. If you're 100% sure that it's well formed XHTML, regexes could be a way to do it though. If not, use some sort of PHP library to do it. In C#, there is something called the HTML Agility Pack, http://htmlagilitypack.codeplex.com, e.g. see How do I parse HTML using regular expressions in C#?. Maybe there is an equivalent tool in PHP.

Community
  • 1
  • 1
nickytonline
  • 6,747
  • 6
  • 40
  • 75
  • I use things like `/color="#000000"` and `:g/ – tchrist Aug 24 '11 at 21:38
  • tchrist, the truth is one off searches you've mentioned above might not be a problem if you have well formed XHTML. but for a more robust solution, i decided to go for php's domdocument class as the parser. – VinnyD Sep 01 '11 at 17:10
0

In python the one solution is something like this to get all distinct tag names in html using regex.

import re

s = """<div id="someid">
       <img src="someurl" />
       <br />
       <p>some content</p>
       </div>
    """

print(set(re.findall('<(\w+)', s)))
# {'p', 'img', 'div', 'br'}
or 
print({i.replace('<', '') for i in re.findall('(<\w+)',s)})
# {'p', 'img', 'div', 'br'}
Ajeet Verma
  • 111
  • 1
  • 6