Python regex match to parse html

Question

I'm playing with python and I would like to solve the following problem with a regex:

I would like to parse html from a Website with regex. I get the site in a String. I take every line of the site in a loop.

for line in html.splitlines():
    #print line
    matchObj = re.match( r'<h1(.*)>', line, re.M|re.I)
    if matchObj:
        print matchObj.group()

I would like to match every line which matches with
<h1 class="hidden offscreen" tabindex="0"> anyContent </h1>

[Obligatory link - Don't parse html with regex](http://stackoverflow.com/a/1732454/1561176) — Inbar Rose, Dec 05 '13 at 09:46
What do you need a good hint for? What does/does not work for you? Have you taken a look a the documentation of the re module? — Daniel, Dec 05 '13 at 09:47
Use the HTMLParser module. It's not that hard if you look at their examples. — Hai Vu, Dec 05 '13 at 09:48
@InbarRose [Obligatory link](http://stackoverflow.com/q/4231382/471272): choose answers not non-answers. — tchrist, Jun 08 '14 at 19:58

georg · Accepted Answer · 2013-12-05T09:59:20.593

1

A naive version would be

html = '<h1 class="hidden offscreen" tabindex="0"> anyContent </h1>'
print re.search('(?is)<h1[^>]*>(.+?)</h1>', html).group(1)

Note that this assumes valid html, if this might not be the case it's safer to use a parser:

from BeautifulSoup import BeautifulSoup
print BeautifulSoup(html).find("h1").text

edited Dec 05 '13 at 09:59

answered Dec 05 '13 at 09:50

georg

195,833
46
263
351

score 0 · Answer 2 · answered Dec 05 '13 at 09:51

If you ONLY want to parse such content you can do it with a regex along these lines:

<h1 class="hidden offscreen" tabindex="0">(?p<content>.*?)</h1>

Do NOT try to expand this to other tags or cases. HTML uses a more complex grammar than regex can handle.

I agree with Hai Vu's comment, use the HTMLParser module. Or beautiful soup: http://www.crummy.com/software/BeautifulSoup/

score 0 · Answer 3 · edited Apr 05 '21 at 14:18

Here is an old script I wrote back when. It should get you started. Pay attention to attrs from the handle_starttag() method.

import HTMLParser

class HeadersParser(HTMLParser.HTMLParser, object):
    def __init__(self):
        super(HeadersParser, self).__init__()
        self.in_header = False
    def handle_starttag(self, tag, attrs):
        if tag.lower() == 'h1':
            self.in_header = True
    def handle_endtag(self, attrs):
        self.in_header = False
    def handle_data(self, data):
        if self.in_header:
            print '{}'.format(data)
            
with open('sample.html') as f:
    html_contents = f.read()
    
parser = HeadersParser()
parser.feed(html_contents)

Python regex match to parse html

3 Answers3