Python scraping data with improper html structure

Question

I have the following code:

*** REST OF CODE OMITTED ***
try:
    fullURL = blitzurl + movie
    opener.open(blitzurl)
    urllib2.install_opener(opener)
    request = urllib2.Request(fullURL)
    requestData = urllib2.urlopen(request)
    htmlText = BeautifulSoup(requestData.read())
    
    #panel = htmlText.find(match_class(["panelbox"]))
    #table = htmlText.find("table", {"id" : "scheduletbl"})
    print htmlText

blah....

except Exception, e:
    print str(e)
    print "ERROR: ERROR OCCURED IN MAIN"

I am trying to get the content of a table with id "scheduletbl" (which is inside a div with a class named "panelbox"

the html code looks like this:

*** REST OF CODE OMITTED ***

<div class="panelbox">
<!-- !!!! content here !!!!! -->
<table border="0" cellpadding="2" cellspacing="0" id="scheduletbl" width="95%">
<tr>
<td align="left" colspan="3">
VC = Special Cinema (Velvet Class)<br/>
VS = Special Cinema (Velvet Suite)<br>
DC = Special Cinema (Dining Cinema)<br/>
S = Special Cinema (Satin)<br/>
3D = in RealD 3D<br/>
4DX = 4DX Cinema
</br></td>
</tr>
<tr>
<td class="separator2" colspan="3"><strong>BLITZMEGAPLEX - PARIS VAN JAVA, BANDUNG</strong></td>
</tr>
<tr>
<td colspan="3"><img align="left" height="16" hspace="5" src="../img/ico_rss_schedule_white.gif" width="16"/><strong><a class="navlink" href="../rss/schedule.php">RSS- Paris van Java</a></strong></td>
</tr>
<tr>
<td class="separator">Â </td>
<td class="separator" colspan="2">TUESDAY, 24 SEPTEMBER 2013</td>
</tr>
<tr>
<td class="separator">Â </td>
<td class="separator" rel="2D" width="20%">
10:30Â Â Â 
</td>
<td class="separator" width="30%">
<a class="navlink" href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-09-24&amp;cinema=0100&amp;movie=MOV1954&amp;showtime=10:30&amp;suite=N&amp;movieformat=2D" target="_blank">Buy Tickets</a></td>
</tr></table></div></div>

<tr>
*** and more <tr> tags ***
*** REST OF CODE OMITTED ***

The problem that i am having is that, when i try to extract the content based on the div-id it gets cut off in the middle (i am guessing because improper closing tag).

The thing also happen when i try to extract the content based on (using its id). It will also gets cut off in the middle because there is a , where it is not suppose to be there.

What are the best way to solve this? I have no control about the data since it is scraped from some website.

score 0 · Answer 1 · answered Sep 24 '13 at 05:11

0

The improper closing tag might create a problem if you are using the parser that is included default with the python. It is, as said in the Beautiful soup Documentation: Not very lenient (before Python 2.7.3 or 3.2.2).

So, if you are using the version before that, you might install lxml’s HTML parser which is more lenient

$ pip install lxml

or if you want the same html parsing as done by the browsers, you might install html5lib parser

$ pip install html5lib

They might parse your HTML better and be resilient to bad tag closing. Beautiful soup automatically chooses the best parser you have installed.

answered Sep 24 '13 at 05:11

Aavaas

677
1
6
16

Thanks... for the suggestion. Unfortunately i am on windows machine (which means i have to install gcc equivalent or windows) darn... Nevertheless, i will try that. Thanks again though for the suggestion. – Jeremy Sep 24 '13 at 05:12
no you dont, you just install pip and install those parsers, easy as pie. http://stackoverflow.com/questions/4750806/how-to-install-pip-on-windows – Aavaas Sep 24 '13 at 05:14
the easiest way is (credit to Gringo , from the link above) http://stackoverflow.com/a/14407505/2303994 – Aavaas Sep 24 '13 at 05:19
i mean i already have pip installed (with several packages), it just that if i want to have lxml installed it always throws errors (This is now completely out of topic). – Jeremy Sep 24 '13 at 05:36

score 0 · Answer 2 · answered Sep 24 '13 at 06:02

0

re.search(r'id="scheduletbl".+?</table>', page, re.DOTALL)

dotall if newlines are involved. this is the ugly non beautiful way to do it

answered Sep 24 '13 at 06:02

Wonton_User

170
1
11

score 0 · Answer 3 · answered Sep 24 '13 at 06:05

You can try to use https://scraperwiki.com/ - if you wish to check which tool/lib would fit you best for this task.

There is an option of using html5lib, pyquery,bs4 etc. (simple to test out)

You can try beautifulsoup:

BeautifulSoup(html).prettify()

where html is your content

BS should be good at handling bad html...

Python scraping data with improper html structure

3 Answers3