I'm currently on a mission to scrape popular joke websites. One example is a website called jokes.cc.com. If you visit the website, hover your cursor above the 'Get Random Joke' button on the left of the page briefly, you will notice the link it redirects to will be jokes.cc.com/#
.
If you wait for a while, it changes to a proper link within the website which displays the actual joke. It changes to jokes.cc.com/*legit joke link*
.
If you analyze the HTML of the page, you will notice that there is a link ( <a>
) with a class=random_link
whose <href>
stores the link to the random joke the page wants to redirect you do. You can check this after the page has completely loaded. Basically, the '#' is replaced by a legit link.
Now, here is my code for scraping off the HTML as I have done with static websites until now. I have used BeautifulSoup
library:
import urllib
from bs4 import BeautifulSoup
urlToRead = "http://jokes.cc.com";
handle = urllib.urlopen(urlToRead)
htmlGunk = handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
# Find out the exact position of the joke in the page
print soup.findAll('a', {'class':'random_link'})[0]
Output: #
This is the expected output as I have come to realize that the page has not completely rendered.
How do I scrape the page after waiting a while, or after the rendering is complete. Will I need to use external libraries like Mechanize? I'm unsure on how to do that so any help/guidance is appreciated
EDIT: I was finally able to resolve my issue by using PhantomJS along with Selenium in Python. Here is the code which fetches the page after rendering is complete.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS() #selenium for PhantomJS
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #fetch HTML source code after rendering
# locate the link in HTML
randomJokeLink = soupFromJokesCC.findAll('div', {'id':'random_joke'})[0].findAll('a')[0]['href']
# now go to that page and scrape the joke from there
print randomJokeLink #It works :D