0

Are there websites which identify it as a script that is accessing it , inspite of changing the User-Agent headers which I assume is like this and gives an error.

import urllib,urllib2
req_headers = {'User-Agent':'Mozilla/5.0'}
req = urllib2.Request(url,headers = req_headers)
html = req.open(url)

If yes , then how?

Manoj
  • 901
  • 3
  • 11
  • 35

2 Answers2

0

First of all, your User Agent is pretty incomplete and easily detectable as fake.

I describe some robot detection techniques in my answer to Hunting cheaters in a voting competition.

Community
  • 1
  • 1
Otto Allmendinger
  • 25,140
  • 7
  • 64
  • 79
0

Yes. For starters, look at your complete header when browsing the web using a tool like Firebug. You'll notice normal browsers provide a lot of information such as languages accepted that is not provided by urllib. So a website might check for the presence of other header information.

Another trick would be to include a 1x1 pixel image on a page and check if the client requested the image file. If not, then the client is using either a text only browser (like lynx) or is actually a script. I think JavaScript can also be used to look for the presence of a mouse.

Generally, it's a game of cat and mouse. One alternative to urllib is Selenium. Selenium will launch a browser window.

ChrisP
  • 5,242
  • 1
  • 26
  • 34