Easiest way to parse specific pieces of information from HTML

Question

I know the question title isn't amazing, but I can't think of a better way to word it. I have a bit of HTMl that I need to search:

<tr bgcolor="#e2d8d4">
<td>1</td>
<td>12:00AM</td>
<td>Show Name<a name="ID#"></a></td>
<td>Winter 12</td>
<td>Channel</td>
<td>Production Company</td>
<td nowrap>1d 11h 9m (air time)</td>
<td align="center">11</td>
<td>
<a href="link">AniDB</a></td>
<td><a href="link">Home</a></td>
</tr>

The page is several dozen of these html blocks. I need to be able to, with just Show Name, pick out the air time of a given show, as well as the bgcolor. (full page here: http://www.mahou.org/Showtime/Planner/). I am assuming the best bet would be a regexp, but I am not confident in that assumption. I would prefer not to use 3rd party modules (BeautifulSoup). I apologize in advance if the question is vague.

Don't use regexp to parse html. BeautifulSoup is actually what you need. — EwyynTomato, Mar 16 '12 at 04:17
At least use [HTMLParser](http://docs.python.org/library/htmlparser.html) but I prefer `lxml` or `beautifulsoup`. [Use regex to parser HTLM is bad](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Kien Truong, Mar 16 '12 at 04:17

score 1 · Accepted Answer · edited May 23 '17 at 12:04

Thank you for doing your research - it's good that you are aware of BeautifulSoup. This would really be the best way to go about solving your problem.

That aside... here is a generic strategy you can choose to implement using regexes (if your sanity is questionable) or using BeautifulSoup (if you're sane.)

It looks like the data you want is always in a table that starts off like:

<table summary="Showtime series for Sunday in a Planner format." border="0" bgcolor="#bfa89b" cellpadding="0" cellspacing="0" width="100%">

You can isolate this by looking for the summary="Showtime series for (Monday|Tuesday|....|Sunday)" attribute of the table, which is unique in the page.
One you have isolated that table, the format of the rows within the table is well defined. I would get <tr> at a time and assume that the second <td> will always contain the airing time, and the third <td> will always contain the show's name.

Regexes can be good for extracting very simple things from HTML, such as "the src paths of all img tags", but once you start talking about nested tags like "find the second <td> tag of each <tr> tag of the table with attribute summary="...", it becomes much harder to do. This is because regular expressions are not designed to work with nested structures.

See the canonical answer to 'regexps and HTML' questions, and Tom Christiansen's explanation of what it takes to use regexps on arbitrary HTML. tchrist proves that you can use regexps to parse any HTML you want - if you're sufficiently determined - but that a proper parsing library like BeautifulSoup is faster, easier, and will give better results.

Thank you very much for your insight. I generally prefer to use default libraries whenever possible, but it seems the cost of doing this with built-in modules vs third party modules is too high for any possible gains. That said, would it be possible to get an example of using beautiful soup to solve such a dilemma? (I've never used beautiful soup before, and can't seem to figure out how to iterate through elements within the summary) — Cirno, Mar 16 '12 at 04:53

Robert Smith · Answer 2 · 2012-03-16T05:33:35.580

This was supposed to be a comment, but it turned out too long.

BeautifulSoup's documentation is pretty good, as it contains quite a bit of examples, just be aware that there are two versions and not each of them plays nicely with every version of Python, although probably you won't have problems there (see this: "Beautiful Soup 4 works on both Python 2 (2.7+) and Python 3.").

Furthermore, HTML parsers like BeautifulSoup or lxml clean your HTML before processing it (to make it valid and so you can traverse its tree properly), so they may move certain elements regarded as invalid. Usually, you can disable that feature but then it's not certain you're going to get the results you want.

There are other approaches to solve the task you're asking. However, they're much more involved to implement, so maybe it's not desirable under the conditions you described. But just to let you know, the whole field of information extraction (IE) deals with that kind of issues. Here (PDF) is a more or less recent survey about it, focused mainly on IE for extracting HTML (semi-structured, as they called it) webpages.

Easiest way to parse specific pieces of information from HTML

2 Answers2