1

I'm currently on a mission to scrape popular joke websites. One example is a website called jokes.cc.com. If you visit the website, hover your cursor above the 'Get Random Joke' button on the left of the page briefly, you will notice the link it redirects to will be jokes.cc.com/#.

If you wait for a while, it changes to a proper link within the website which displays the actual joke. It changes to jokes.cc.com/*legit joke link*.

If you analyze the HTML of the page, you will notice that there is a link ( <a>) with a class=random_link whose <href> stores the link to the random joke the page wants to redirect you do. You can check this after the page has completely loaded. Basically, the '#' is replaced by a legit link.

Now, here is my code for scraping off the HTML as I have done with static websites until now. I have used BeautifulSoup library:

import urllib
from bs4 import BeautifulSoup

urlToRead = "http://jokes.cc.com";
handle = urllib.urlopen(urlToRead)
htmlGunk =  handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
# Find out the exact position of the joke in the page
print soup.findAll('a', {'class':'random_link'})[0]

Output: #

This is the expected output as I have come to realize that the page has not completely rendered.

How do I scrape the page after waiting a while, or after the rendering is complete. Will I need to use external libraries like Mechanize? I'm unsure on how to do that so any help/guidance is appreciated

EDIT: I was finally able to resolve my issue by using PhantomJS along with Selenium in Python. Here is the code which fetches the page after rendering is complete.

from bs4 import BeautifulSoup
from selenium import webdriver


driver = webdriver.PhantomJS() #selenium for PhantomJS
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #fetch HTML source code after rendering
# locate the link in HTML
randomJokeLink = soupFromJokesCC.findAll('div', {'id':'random_joke'})[0].findAll('a')[0]['href']
# now go to that page and scrape the joke from there
print randomJokeLink #It works :D
bholagabbar
  • 3,337
  • 2
  • 22
  • 46

1 Answers1

1

The data you're after is generated by JavaScript running dynamically on page load. BeautifulSoup does not have a JavaScript engine, so it doesn't matter how long you wait, the link will never change. There are Python libraries which can scrape and understand JavaScript, but your best bet is probably to dig and work out how the JS on the website actually works. If they have a data feed of jokes that a random joke is pulled from, for example, it might be in a format such as JSON which Python can parse very easily. This would make your application much more lightweight than including a fully blown scripting engine.

James Scholes
  • 7,025
  • 3
  • 16
  • 19
  • Would *selenium* browser automation be the way to go? – bholagabbar Mar 28 '16 at 14:49
  • N.B. I've never used Selenium, but that depends on the scope of your project. If you're writing an application to display jokes, then automating a web browser probably isn't ideal. It would require your users to have the browser installed and open, and you'd end up offloading a lot of the work to that browser. If you dig down into how the JavaScript works, however, you can recreate the behaviour inside your app and scrape jokes without needing to even think about JavaScript. – James Scholes Mar 28 '16 at 14:56
  • How about using a headless browser of sorts? – bholagabbar Mar 28 '16 at 16:00
  • I managed to resolve it using PhantomJS along with Selenium..have a look at the updated description – bholagabbar Mar 29 '16 at 16:03