Website-scraping , robot-identification

Question

Are there websites which identify it as a script that is accessing it , inspite of changing the User-Agent headers which I assume is like this and gives an error.

import urllib,urllib2
req_headers = {'User-Agent':'Mozilla/5.0'}
req = urllib2.Request(url,headers = req_headers)
html = req.open(url)

If yes , then how?

score 0 · Answer 1 · edited May 23 '17 at 11:56

0

First of all, your User Agent is pretty incomplete and easily detectable as fake.

I describe some robot detection techniques in my answer to Hunting cheaters in a voting competition.

edited May 23 '17 at 11:56

Community

1
1

answered Jul 13 '12 at 14:14

Otto Allmendinger

25,140
7
64
79

score 0 · Accepted Answer · answered Jul 13 '12 at 14:16

Yes. For starters, look at your complete header when browsing the web using a tool like Firebug. You'll notice normal browsers provide a lot of information such as languages accepted that is not provided by urllib. So a website might check for the presence of other header information.

Another trick would be to include a 1x1 pixel image on a page and check if the client requested the image file. If not, then the client is using either a text only browser (like lynx) or is actually a script. I think JavaScript can also be used to look for the presence of a mouse.

Generally, it's a game of cat and mouse. One alternative to urllib is Selenium. Selenium will launch a browser window.

Website-scraping , robot-identification

2 Answers2