Python Regex - Parsing HTML

Question

I have this little code and it's giving me AttributeError: 'NoneType' object has no attribute 'group'.

import sys
import re

#def extract_names(filename):

f = open('name.html', 'r')
text = f.read()

match = re.search (r'<hgroup><h1>(\w+)</h1>', text)
second = re.search (r'<li class="hover">Employees: <b>(\d+,\d+)</b></li>', text)  

outf = open('details.txt', 'a')
outf.write(match)
outf.close()

My intention is to read a .HTML file looking for the <h1> tag value and the number of employees and append them to a file. But for some reason I can't seem to get it right. Your help is greatly appreciated.

@larsmans: the myriad others also include [this one](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) which actually demonstrates how it is possible to parse HTML with regexes. And Helen's task right here is teeny-tiny compared to that. So not so trigger-happy. — ЯegDwight, Sep 20 '12 at 15:28
It’s a shame you can’t use `vi` to edit HTML files, innit? — tchrist, Sep 20 '12 at 15:54
I think that some higher level libraries like Scrapy of Beautiful Soap would fit better your task than regular expressions. — mariosangiorgio, Sep 20 '12 at 13:15
If you show the relevant portions of the HTML file you are looking at, it would help a great deal. — tchrist, Sep 21 '12 at 18:02

Martijn Pieters · Answer 1 · 2012-09-20T13:22:23.020

6

You are using a regular expression, but matching XML with such expressions gets too complicated, too fast. Don't do that.

Use a HTML parser instead, Python has several to choose from:

ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.

The latter two handle malformed HTML quite gracefully as well, making decent sense of many a botched website.

ElementTree example:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('h1'):
    print ElementTree.tostring(elem)

edited Sep 20 '12 at 13:22

answered Sep 20 '12 at 13:15

Martijn Pieters

889,049
245
3,507
2,997

2

Better use BeatifulSoup or `lxml.html` for HTML files, though -- more often than not, they're malformed XML. – Fred Foo Sep 20 '12 at 13:16

score 1 · Accepted Answer · answered Sep 20 '12 at 15:35

1

Just for the sake of completion: your error message just indicate that your regular expression failed and did not return anything...

answered Sep 20 '12 at 15:35

Pierre GM

17,529
3
48
64

Python Regex - Parsing HTML

2 Answers2

Linked