Detecting honest web crawlers

Question

I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyone have any more solid approaches? If not, do you have any resources you use to keep up to date with all the friendly user agents?

If you're curious: I'm not trying to do anything against any search engine policy. We have a section of the site where a user is randomly presented with one of several slightly different versions of a page. However if a web crawler is detected, we'd always give them the same version so that the index is consistent.

Also I'm using Java, but I would imagine the approach would be similar for any server-side technology.

Dave Sumter · Answer 1 · 2013-01-26T10:55:53.953

You said matching the user agent on ‘bot’ may be awkward, but we’ve found it to be a pretty good match. Our studies have shown that it will cover about 98% of the hits you receive. We also haven’t come across any false positive matches yet either. If you want to raise this up to 99.9% you can include a few other well-known matches such as ‘crawler’, ‘baiduspider’, ‘ia_archiver’, ‘curl’ etc. We’ve tested this on our production systems over millions of hits.

Here are a few c# solutions for you:

1) Simplest

Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user. Catches 99+% of crawlers.

bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);

2) Medium

Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too. Catches close to 100% of crawlers. Matches ‘bot’, ‘crawler’, ‘spider’ upfront. You can add to it any other known crawlers.

List<string> Crawlers3 = new List<string>()
{
    "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
    "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
    "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
    "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
    "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
    "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
    "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
    "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
    "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
    "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
    "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
    "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
    "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
    "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
    "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
    "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
    "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
    "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));

3) Paranoid

Is pretty fast, but a little slower than options 1 and 2. It’s the most accurate, and allows you to maintain the lists if you want. You can maintain a separate list of names with ‘bot’ in them if you are afraid of false positives in future. If we get a short match we log it and check it for a false positive.

// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
    "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
    "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
    "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
    "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
    "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
    "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
    "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
    "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
    "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};

// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
    "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
    "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
    "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
    "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
    "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
    "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
    "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
    "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
    "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
    "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
    "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
    "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
    "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
    "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
    "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
    "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
    "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
    "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
    "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
    "legs","curl","webs","wget","sift","cmc"
};

string ua = Request.UserAgent.ToLower();
string match = null;

if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));

if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);

bool iscrawler = match != null;

Notes:

It’s tempting to just keep adding names to the regex option 1. But if you do this it will become slower. If you want a more complete list then linq with lambda is faster.
Make sure .ToLower() is outside of your linq method – remember the method is a loop and you would be modifying the string during each iteration.
Always put the heaviest bots at the start of the list, so they match sooner.
Put the lists into a static class so that they are not rebuilt on every pageview.

Honeypots

The only real alternative to this is to create a ‘honeypot’ link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers.

Postives: It will match some unknown crawlers that aren’t declaring themselves.

Negatives: Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.

Do we have some C# nugget package which have list of headers for most popular web spiders? I mean it would be nice get update from time to time, because some spiders stop working, some change their headers — Duke, Mar 27 '14 at 20:16
Hmmm, not that I'm aware of.. The other problem is that some spiders are not "registered" in any locations or don't even set a user agent string.. I can create a spider now and run it from my pc.. — Dave Sumter, Apr 18 '14 at 08:58
Since your list Crawlers1 ends with the entry "bot", then your lookup into this list will always succeed for ui.Contains("bot").... so you wouldn't even need to check the list in that case. Either change the list to remove "bot", or, if it's a valid entry, skip the Contains code and just assume it's a bot. — Andy, Jun 30 '14 at 19:42
Hi Andy, you are correct. As per my answer I left the term 'bot' in there as a catch-all, but some may want to remove it if they don't want false-positives. If they do keep it then they don't need to do the sub-lookup like you suggested.. I use it to harvest new matches and log them.. — Dave Sumter, Jul 17 '14 at 17:26
`It will match some unknown crawlers that aren’t declaring themselves.` - This can be dangerous if the crawler is using a normal user agent (i.e. as if they are regular user). — CodeNaked, Jul 02 '15 at 15:14

score 28 · Accepted Answer · answered Feb 13 '09 at 02:12

28

You can find a very thorough database of data on known "good" web crawlers in the robotstxt.org Robots Database. Utilizing this data would be far more effective than just matching bot in the user-agent.

answered Feb 13 '09 at 02:12

Sparr

7,297
27
46

score 10 · Answer 3 · answered Feb 13 '09 at 02:18

10

One suggestion is to create an empty anchor on your page that only a bot would follow. Normal users wouldn't see the link, leaving spiders and bots to follow. For example, an empty anchor tag that points to a subfolder would record a get request in your logs...

<a href="dontfollowme.aspx"></a>

Many people use this method while running a HoneyPot to catch malicious bots that aren't following the robots.txt file. I use the empty anchor method in an ASP.NET honeypot solution I wrote to trap and block those creepy crawlers...

answered Feb 13 '09 at 02:18

Dscoduc

7,394
9
40
47

1

Just out of curiosity, this made me wonder if that might mess with accessibility. Like if someone could accidentally select that anchor using the Tab key and then hit Return to click it after all. Well, apparently not (see http://jsbin.com/efipa/ for a quick test), but of course I've only tested with a normal browser. – Arjan Oct 08 '09 at 16:40
Need to be a little bit careful with techniques like this that you don't get your site blacklisted for using blackhat SEO techniques. – Ian Mercer Apr 23 '10 at 21:14
Also, what if the bot is using a normal user agent like any other visitor would, too? – Martin Braun Dec 22 '16 at 13:40

score 7 · Answer 4 · answered Feb 13 '09 at 02:06

7

Any visitor whose entry page is /robots.txt is probably a bot.

answered Feb 13 '09 at 02:06

Sparr

7,297
27
46

Or, to be less strict, a visitor who requests robots.txt at all is probably a bot, although there are a few firefox plugins that grab it while a human is browsing. – Sparr Feb 17 '09 at 21:04
5

Any bot that goes there is probably a well-behaved, respectable bot, the kind you might want visiting your site :-) – Ian Mercer Apr 23 '10 at 21:15

score 4 · Answer 5 · answered Jun 25 '10 at 22:43

4

Something quick and dirty like this might be a good start:

return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i

Note: rails code, but regex is generally applicable.

answered Jun 25 '10 at 22:43

Brian Armstrong

19,149
14
109
140

All for quick and dirty.. One caveat though, I find it's useful to revisit these types of solutions at least once a year and expand the 'dirty' list, as they tend to grow. For me this is good for numbers that only need to be 90%+ accurate.. – Dave Sumter Jan 22 '13 at 10:07

score 0 · Answer 6 · answered Aug 21 '12 at 07:57

0

I'm pretty sure a large proportion of bots don't use robots.txt, however that was my first thought.

It seems to me that the best way to detect a bot is with time between requests, if the time between requests is consistently fast then its a bot.

answered Aug 21 '12 at 07:57

Stewart McKee

106
1
3

score 0 · Answer 7 · answered Apr 13 '21 at 10:13

void CheckBrowserCaps()
    {
        String labelText = "";
        System.Web.HttpBrowserCapabilities myBrowserCaps = Request.Browser;
        if (((System.Web.Configuration.HttpCapabilitiesBase)myBrowserCaps).Crawler)
        {
            labelText = "Browser is a search engine.";
        }
        else
        {
            labelText = "Browser is not a search engine.";
        }

        Label1.Text = labelText;
    }

HttpCapabilitiesBase.Crawler Property

Detecting honest web crawlers

7 Answers7

1) Simplest

2) Medium

3) Paranoid

Linked