11

I need to detect scraping of info on my website. I tried detection based on behavior patterns, and it seems to be promising, although relatively computing heavy.

The base is to collect request timestamps of certain client side and compare their behavior pattern with common pattern or precomputed pattern.

To be more precise, I collect time intervals between requests into array, indexed by function of time:

i = (integer) ln(interval + 1) / ln(N + 1) * N + 1
Y[i]++
X[i]++ for current client

where N is time (count) limit, intervals greater than N are dropped. Initially X and Y are filled with ones.

Then, after I got enough number of them in X and Y, it's time to make decision. Criteria is parameter C:

C = sqrt(summ((X[i]/norm(X) - Y[i]/norm(Y))^2)/k)

where X is certain client data, Y is common data, and norm() is calibration function, and k is normalization coefficient, depending on type of norm(). There are 3 types:

  1. norm(X) = summ(X)/count(X), k = 2
  2. norm(X) = sqrt(summ(X[i]^2), k = 2
  3. norm(X) = max(X[i]), k is square root of number of non-empty elements X

C is in range (0..1), 0 means there is no behavior deviation and 1 is max deviation.

Сalibration of type 1 is best for repeating requests, type 2 for repeating request with few intervals, type 3 for non-constant request intervals.

What do you think? I'll appreciate if you'll try this on your services.

rook
  • 62,960
  • 36
  • 149
  • 231
aks
  • 143
  • 1
  • 1
  • 6
  • 2
    I just have to say: scraping will always exist. In the future you should at least consider a business model suited for the 21st century. – rook Mar 20 '11 at 23:54

4 Answers4

10

To be honest your approach is completely worthless because its trivial bypass. An attacker doesn't even have to write a line of code to bypass it. Proxy servers are free and you can boot up a new machine with a new ip address on amazon ec2 for 2 cents an hour.

A better approach is Roboo which uses cookie techniques to foil robots. The vast majority of robots can't run javascript or flash, and this can be used to your advantage.

However all of this "(in)security though obscurity", and the ONLY REASON why it might work is because your data isn't worth a programmer spending 5 minutes on it. (Roboo included)

rook
  • 62,960
  • 36
  • 149
  • 231
3

If you are asking specifically to the validity of your algorithm, it isnt bad but it seems like you are over complicating it. You should use the basic methodologies already employed by WAF's to rate limit connections. One such algorithm that already exists is the Leaky Bucket Algorith (http://en.wikipedia.org/wiki/Leaky_bucket).

As far as rate limiting to stop web scraping, there are two flaws in trying to rate limit connections. First is people's ability to use proxy networks or TOR to anonymize each request. This essentially nullifies your efforts. Even off the shelf scraping software like http://www.mozenda.com use a huge block of IPs and rotate through them to solve this problem. The other issue is that you could potentially block people using a shared IP. Companies and universities often use NATs and your algorithm could mistake them as one person.

For full disclosure, I am a cofounder of Distil Networks and we often poke holes in WAF like rate limiting. We pitch that a more comprehensive solution is required and hence the need for our service.

Rami
  • 970
  • 8
  • 7
3

I do a lot of web scraping and always use multiple IP addresses and random intervals between each request.

When scraping a page I typically only download the HTML and not the dependencies (images, CSS, etc). So you could try checking if the user downloads these dependencies.

hoju
  • 24,959
  • 33
  • 122
  • 169
  • 1
    It's easiest detection way, among with cookies checking, and it's obvious to implement. Here I try to guess scraping by anomaly in user activity. This may result in false alarm, anyway, the user was doing something starnge. – aks Mar 25 '11 at 08:31
  • That may not work in all cases because a lot of browsers can be configured to not download any dependencies unless the user clicks on them (i.e., ad blockers, flash blockers, etc.). Text browsers may not download certain dependencies either. – gonzobrains Jul 16 '12 at 17:52
0

Ok, someone could build a robot that would enter your website, download the html (not the images, css, etc, as in @hoju's response) and build a graph of the links to be traversed on your site.

The robot could use random timings to make each request and change the IP in each of them using a proxy, a VPN, Tor, etc.

I was tempted to answer that you could try to trick the robot by adding hidden links using CSS (a common solution found on the Internet). But it is not a solution. When the robot accesses a forbidden link you can prohibit access to that IP. But you would end up with a huge list of banned IPs. Also, if someone started spoofing IPs and making requests to that link on your server, you could end up isolated from the world. Apart from anything else, it is possible that a solution can be implemented that allows the robot to see the hidden links.

A more effective way, I think, would be to check the IP of each incoming request, with an API that detects proxies, VPNs, Tor, etc. I searched Google for "api detection vpn proxy tor" and found some (paid) services. Maybe there are free ones.

If the API response is positive, forward the request to a Captcha.

user9869932
  • 4,537
  • 3
  • 42
  • 42