0

I have the following code:

*** REST OF CODE OMITTED ***
try:
    fullURL = blitzurl + movie
    opener.open(blitzurl)
    urllib2.install_opener(opener)
    request = urllib2.Request(fullURL)
    requestData = urllib2.urlopen(request)
    htmlText = BeautifulSoup(requestData.read())
    
    #panel = htmlText.find(match_class(["panelbox"]))
    #table = htmlText.find("table", {"id" : "scheduletbl"})
    print htmlText

blah....

except Exception, e:
    print str(e)
    print "ERROR: ERROR OCCURED IN MAIN"

I am trying to get the content of a table with id "scheduletbl" (which is inside a div with a class named "panelbox"

the html code looks like this:

*** REST OF CODE OMITTED ***

<div class="panelbox">
<!-- !!!! content here !!!!! -->
<table border="0" cellpadding="2" cellspacing="0" id="scheduletbl" width="95%">
<tr>
<td align="left" colspan="3">
VC = Special Cinema (Velvet Class)<br/>
VS = Special Cinema (Velvet Suite)<br>
DC = Special Cinema (Dining Cinema)<br/>
S = Special Cinema (Satin)<br/>
3D = in RealD 3D<br/>
4DX = 4DX Cinema
</br></td>
</tr>
<tr>
<td class="separator2" colspan="3"><strong>BLITZMEGAPLEX - PARIS VAN JAVA, BANDUNG</strong></td>
</tr>
<tr>
<td colspan="3"><img align="left" height="16" hspace="5" src="../img/ico_rss_schedule_white.gif" width="16"/><strong><a class="navlink" href="../rss/schedule.php">RSS- Paris van Java</a></strong></td>
</tr>
<tr>
<td class="separator"> </td>
<td class="separator" colspan="2">TUESDAY, 24 SEPTEMBER 2013</td>
</tr>
<tr>
<td class="separator"> </td>
<td class="separator" rel="2D" width="20%">
10:30   
</td>
<td class="separator" width="30%">
<a class="navlink" href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-09-24&amp;cinema=0100&amp;movie=MOV1954&amp;showtime=10:30&amp;suite=N&amp;movieformat=2D" target="_blank">Buy Tickets</a></td>
</tr></table></div></div>

<tr>
*** and more <tr> tags ***
*** REST OF CODE OMITTED ***

The problem that i am having is that, when i try to extract the content based on the div-id it gets cut off in the middle (i am guessing because improper closing tag).

The thing also happen when i try to extract the content based on (using its id). It will also gets cut off in the middle because there is a , where it is not suppose to be there.

What are the best way to solve this? I have no control about the data since it is scraped from some website.

barny
  • 5,280
  • 4
  • 16
  • 21
Jeremy
  • 2,346
  • 8
  • 39
  • 73

3 Answers3

0

The improper closing tag might create a problem if you are using the parser that is included default with the python. It is, as said in the Beautiful soup Documentation: Not very lenient (before Python 2.7.3 or 3.2.2).

So, if you are using the version before that, you might install lxml’s HTML parser which is more lenient

$ pip install lxml

or if you want the same html parsing as done by the browsers, you might install html5lib parser

$ pip install html5lib

They might parse your HTML better and be resilient to bad tag closing. Beautiful soup automatically chooses the best parser you have installed.

Aavaas
  • 677
  • 1
  • 6
  • 16
  • Thanks... for the suggestion. Unfortunately i am on windows machine (which means i have to install gcc equivalent or windows) darn... Nevertheless, i will try that. Thanks again though for the suggestion. – Jeremy Sep 24 '13 at 05:12
  • no you dont, you just install pip and install those parsers, easy as pie. http://stackoverflow.com/questions/4750806/how-to-install-pip-on-windows – Aavaas Sep 24 '13 at 05:14
  • the easiest way is (credit to Gringo , from the link above) http://stackoverflow.com/a/14407505/2303994 – Aavaas Sep 24 '13 at 05:19
  • i mean i already have pip installed (with several packages), it just that if i want to have lxml installed it always throws errors (This is now completely out of topic). – Jeremy Sep 24 '13 at 05:36
0
re.search(r'id="scheduletbl".+?</table>', page, re.DOTALL) 

dotall if newlines are involved. this is the ugly non beautiful way to do it

Wonton_User
  • 170
  • 1
  • 11
0

You can try to use https://scraperwiki.com/ - if you wish to check which tool/lib would fit you best for this task.

There is an option of using html5lib, pyquery,bs4 etc. (simple to test out)

You can try beautifulsoup:

BeautifulSoup(html).prettify()

where html is your content

BS should be good at handling bad html...

brunod
  • 24
  • 3