17

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?

hymloth
  • 6,272
  • 5
  • 33
  • 47
  • I answered a similar question on [Click on a javascript link within python?](http://stackoverflow.com/questions/5207948/click-on-a-javascript-link-within-python/5227031#5227031) – sw. Mar 13 '11 at 00:42
  • Does this answer your question? [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – ggorlen Feb 27 '21 at 22:58

5 Answers5

11

In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.

You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.

Paul D. Waite
  • 89,393
  • 53
  • 186
  • 261
  • 1
    Is there a way to do it with requests and beautiful soup itself? I have been using requests and it works fine in every other case but this. Please let me know if requests can also solve this thing. – Shaardool Jun 09 '15 at 16:38
  • @Shaardool: solve what thing? Scraping HTML that’s generated in the browser by JavaScript? No — for that you need something that runs the JavaScript so that it can produce the HTML. Beautiful Soup doesn’t run JavaScript. – Paul D. Waite Jun 09 '15 at 16:48
  • thanks for the insight, can Requests library do it? It works well with AJAX requests to server, but I want to know if it can work with javascript that creates HTML too. I didn't find any such thing in their documentation, though. – Shaardool Jun 09 '15 at 16:56
  • @Shaardool I’m not familiar with Requests library. You’ll likely get an answer quicker by asking a new question specifically about that library. – Paul D. Waite Jun 09 '15 at 17:36
10

Since there is no comprehensive answer here, I'll go ahead and write one.

To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)

Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.

So here's what you do:

Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk

driver.quit()
Community
  • 1
  • 1
bholagabbar
  • 3,337
  • 2
  • 22
  • 46
4

I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.

This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.

Bryan Batchelder
  • 3,557
  • 19
  • 17
3

I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.

hoju
  • 24,959
  • 33
  • 122
  • 169
2

For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.

It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.

lgaggini
  • 391
  • 3
  • 10