-1

I'm playing with python and I would like to solve the following problem with a regex:

I would like to parse html from a Website with regex. I get the site in a String. I take every line of the site in a loop.

for line in html.splitlines():
    #print line
    matchObj = re.match( r'<h1(.*)>', line, re.M|re.I)
    if matchObj:
        print matchObj.group()

I would like to match every line which matches with
<h1 class="hidden offscreen" tabindex="0"> anyContent </h1>

Nimantha
  • 4,731
  • 5
  • 15
  • 38
John Smithv1
  • 583
  • 4
  • 12
  • 31

3 Answers3

1

A naive version would be

html = '<h1 class="hidden offscreen" tabindex="0"> anyContent </h1>'
print re.search('(?is)<h1[^>]*>(.+?)</h1>', html).group(1)

Note that this assumes valid html, if this might not be the case it's safer to use a parser:

from BeautifulSoup import BeautifulSoup
print BeautifulSoup(html).find("h1").text
georg
  • 195,833
  • 46
  • 263
  • 351
0

If you ONLY want to parse such content you can do it with a regex along these lines:

<h1 class="hidden offscreen" tabindex="0">(?p<content>.*?)</h1>

Do NOT try to expand this to other tags or cases. HTML uses a more complex grammar than regex can handle.

I agree with Hai Vu's comment, use the HTMLParser module. Or beautiful soup: http://www.crummy.com/software/BeautifulSoup/

dutt
  • 7,007
  • 10
  • 46
  • 76
0

Here is an old script I wrote back when. It should get you started. Pay attention to attrs from the handle_starttag() method.

import HTMLParser

class HeadersParser(HTMLParser.HTMLParser, object):
    def __init__(self):
        super(HeadersParser, self).__init__()
        self.in_header = False
    def handle_starttag(self, tag, attrs):
        if tag.lower() == 'h1':
            self.in_header = True
    def handle_endtag(self, attrs):
        self.in_header = False
    def handle_data(self, data):
        if self.in_header:
            print '{}'.format(data)
            
with open('sample.html') as f:
    html_contents = f.read()
    
parser = HeadersParser()
parser.feed(html_contents)
Nimantha
  • 4,731
  • 5
  • 15
  • 38
Hai Vu
  • 30,982
  • 9
  • 52
  • 84