1

There's a website I used to scrape using a python script (urllib). It seems the website is now blocking my requests and whenever I'm requesting a web page using a script I get an html with some JS but without the usual data. Accessing the website from my browser works just fine. I tried changing the 'User-agent' to fit the one my browser uses but it didn't help. A strange behavior I observed is that after accessing a page from my browser I can access it from the script too.

So my questions are:

  1. How the server can detect it's not a browser (after I change User-agent)?
  2. What kind of mechanism can cause the strange behavior of allowing access only after the web page is loaded by browser? is it caching? if yes, where the caching happens?
  3. Any ideas how to proceed? (I have a not very elegant solution of making my browser open each page before I load it, but it takes too much time)

Thanks!

EliS
  • 41
  • 3

1 Answers1

1

Without too many details to go from, it sounds like the site updated to include a javascript loader. urllib can't process the javascript, so it's unable to continue. (pure speculation here)

There's various ways a site can try to prevent a scraper from accessing it, including having some Javascript set or update a cookie, or modify the session in some way as to pass this first test. It's completely site dependent, so you'll have to investigate it by hand.

The usual solution is to use a javascript aware scraper like Selenium, which actually uses a locally installed Firefox, Chrome or IE browser to open the page, and simulate clicking of items. You can also use PhantomJS to process the downloaded page.

There's plenty of posts on SO about this, but here's one that may give you a starting point: Web-scraping JavaScript page with Python

Community
  • 1
  • 1
VooDooNOFX
  • 4,223
  • 1
  • 19
  • 21