4

I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems with 3.1.0 upwards), but the results are almost same.

I can recall several HTML parser options available in Python from the top of my head:

  • BeautifulSoup
  • lxml
  • pyquery

I intend to test all of these, but I wanted to know which one in your tests come as most forgiving and can even try to parse bad HTML.

Paul D. Waite
  • 89,393
  • 53
  • 186
  • 261
Vaibhav Mishra
  • 9,447
  • 11
  • 41
  • 56
  • 3
    Since this isn't really an answer I'm not posting it as such, but what you're describing is precisely the reason Beautiful Soup was developed: to parse bad HTML. If you have a document that's so horribly malformed even Beautiful Soup can't parse it, you may be out of luck. Other parsers that I've heard of (including lxml) are far more strict. – David Z Jul 29 '11 at 08:26
  • See also http://stackoverflow.com/questions/1922032/parsing-html-in-python-lxml-or-beautifulsoup-which-of-these-is-better-for-what – Paul D. Waite Jul 29 '11 at 08:28
  • 2
    In order to keep this objective, it would be useful to post the minimal snippet for which each parser barfs. – smci Jul 29 '11 at 08:39
  • I haven't tried any other parser, but only beautifulsoup, and it did work 30% of the time in my case, which given by amount of malformed html around is still impressive. – Vaibhav Mishra Jul 29 '11 at 11:05
  • 1
    @Paul, I already looked at that, that's why I mentioned that I used both 3.0.8 and 3.2.0, I will try 4.0 branch to look at other improvements and post my results here – Vaibhav Mishra Jul 29 '11 at 11:55
  • @Vaibhav: sure, I suspected as much, just thought I’d link to it. I had heard it was abandoned, glad someone’s working on a 4.0. – Paul D. Waite Jul 29 '11 at 15:00
  • pyquery is just a wrapper around lxml – cerberos Jul 30 '11 at 07:36

4 Answers4

3

They all are. I have yet to come across any html page found in the wild that lxml.html couldn't parse. If lxml barfs on the pages you're trying to parse you can always preprocess them using some regexps to keep lxml happy.

lxml itself is fairly strict, but lxml.html is a different parser and can deal with very broken html. For extremely brokeh html, lxml also ships with lxml.html.soupparser which interfaces with the BeautifulSoup library.

Some approaches to parsing broken html using lxml.html are described here: http://lxml.de/elementsoup.html

Björn Lindqvist
  • 16,492
  • 13
  • 70
  • 103
2

With pages that don't work with anything else (those that contain nested <form> elements come to mind) I've had success with MinimalSoup and ICantBelieveItsBeautifulSoup. Each can handle certain types of error that the other one can't so often you'll need to try both.

cerberos
  • 6,721
  • 3
  • 37
  • 43
2

I ended up using BeautifulSoup 4.0 with html5lib for parsing and is much more forgiving, with some modifications to my code it's now working considerabily well, thanks all for suggestions.

Vaibhav Mishra
  • 9,447
  • 11
  • 41
  • 56
1

If beautifulsoup doesn't fix your html problem, the next best solution would be regular expression. lxml, elementtree, minidom are very strict in parsing and actually they are doing right.

Other tips:

  1. I feed the html to lynx browser through command prompt, and take out the text version of the page/content and parse using regex.

  2. Converting to html to text or html to markdown strips all the html tags and you will remain with text. That is easy to parse.