Get web page content (Not from source code)

Question

I want to get the rainfall data of each day from here.

When I am in inspect mode, I can see the data. However, when I view the source code, I cannot find it.

I am using urllib2 and BeautifulSoup from bs4

Here is my code:

import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"

r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")

And I got an empty array.

My question is: How can I get the page content, but not from the page source code?

score 3 · Answer 1 · answered Sep 18 '16 at 07:01

3

If you open up the dev tools on chrome/firefox and look at the requests, you'll see that the data is generated from a request to http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml which gives the data for all 12 months which you can then extract from.

answered Sep 18 '16 at 07:01

Asish M.

1,724
13
21

Thats cool! I think your method is more efficient. However, @Simone Zandara 's answer is more stick to the question, so I choose that as the correct ans. – VICTOR Sep 18 '16 at 07:13

score 2 · Accepted Answer · edited May 23 '17 at 10:27

If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.

Try using selenium

How can I parse a website using Selenium and Beautifulsoup in python?

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')

html = driver.page_source
soup = BeautifulSoup(html)

print soup.find_all("td", class_="td1_normal_class")

However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.

Get web page content (Not from source code)

2 Answers2

Linked