Parsing a website with a javascript call using Python

Question

Since I counldn't find an API function in common wikimedia to get alicense of an image, the only thing left to do it to fetch the webpage and parse it myself.

For each image, there is a nice popup in wikimedia that lists the "Attribution" field which I need. For example, in the page http://commons.wikimedia.org/wiki/File:Brad_Pitt_Cannes_2011.jpg there is a link on the right saying "Use this file on the web". When clicking on it I can see the "Attribution" field which I need.

Using Python, how can I fetch the webpage and initiate a javascript call to open that pop up in order to retrieve the text inside the "Attribution" field?

Thanks!

meir

possible duplicate of http://stackoverflow.com/questions/2148493/scrape-html-generated-by-javascript-with-python. — hymloth, Sep 17 '11 at 10:36

score 4 · Answer 1 · answered Sep 17 '11 at 13:44

using unutbu's answer, I converted it to use Selenium WebDriver (rather than the older Selenium-RC).

import codecs
import lxml.html as lh
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://commons.wikimedia.org/wiki/File%3aBrad_Pitt_Cannes_2011.jpg')
content = browser.page_source
browser.quit()

doc = lh.fromstring(content)
for elt in doc.xpath('//span[a[contains(@title,"Use this file")]]/text()'):
    print elt

output:

on the web
on a wiki

Thanks, however, I need to text that opens inside the pop up when clicking "on the web" — Meir, Sep 17 '11 at 15:36

score 1 · Answer 2 · answered Sep 17 '11 at 10:32

Assuming you can read Javascript, you can look at this Javascript file: http://commons.wikimedia.org/w/index.php?title=MediaWiki:Stockphoto.js&action=raw&ctype=text/javascript

You can see what the Javascript does in order to get it's info (look at get_author_attribution and get_license. You can port this to Python using BeautifulSoup to parse the HTML.

unutbu · Answer 3 · 2011-09-17T17:32:06.383

1

I'd be interested to see how this is done using other tools. Using Selenium RC and lxml, it can be done like this:

import selenium

sel=selenium.selenium("localhost",4444,"*firefox", "file://")   
sel.start()
sel.open('http://commons.wikimedia.org/wiki/File%3aBrad_Pitt_Cannes_2011.jpg')

sel.click('//a[contains(@title,"Use this file on the web")]')
print(sel.get_value('//input[@id="stockphoto_attribution"]'))
sel.stop()

yields

Georges Biard [CC-BY-SA-3.0 (www.creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

edited Sep 17 '11 at 17:32

answered Sep 17 '11 at 12:07

unutbu

711,858
148
1,594
1,547

nice. I added a new answer showing Selenium WebDriver instead of Selenium-RC. the webdriver version doesn't require the Selenium server. – Corey Goldberg Sep 17 '11 at 13:45

Parsing a website with a javascript call using Python

3 Answers3