There's a website I used to scrape using a python script (urllib). It seems the website is now blocking my requests and whenever I'm requesting a web page using a script I get an html with some JS but without the usual data. Accessing the website from my browser works just fine. I tried changing the 'User-agent' to fit the one my browser uses but it didn't help. A strange behavior I observed is that after accessing a page from my browser I can access it from the script too.
So my questions are:
- How the server can detect it's not a browser (after I change User-agent)?
- What kind of mechanism can cause the strange behavior of allowing access only after the web page is loaded by browser? is it caching? if yes, where the caching happens?
- Any ideas how to proceed? (I have a not very elegant solution of making my browser open each page before I load it, but it takes too much time)
Thanks!