I need to detect scraping of info on my website. I tried detection based on behavior patterns, and it seems to be promising, although relatively computing heavy.
The base is to collect request timestamps of certain client side and compare their behavior pattern with common pattern or precomputed pattern.
To be more precise, I collect time intervals between requests into array, indexed by function of time:
i = (integer) ln(interval + 1) / ln(N + 1) * N + 1
Y[i]++
X[i]++ for current client
where N is time (count) limit, intervals greater than N are dropped. Initially X and Y are filled with ones.
Then, after I got enough number of them in X and Y, it's time to make decision. Criteria is parameter C:
C = sqrt(summ((X[i]/norm(X) - Y[i]/norm(Y))^2)/k)
where X is certain client data, Y is common data, and norm() is calibration function, and k is normalization coefficient, depending on type of norm(). There are 3 types:
norm(X) = summ(X)/count(X), k = 2
norm(X) = sqrt(summ(X[i]^2), k = 2
norm(X) = max(X[i]), k is square root of number of non-empty elements X
C is in range (0..1), 0 means there is no behavior deviation and 1 is max deviation.
Сalibration of type 1 is best for repeating requests, type 2 for repeating request with few intervals, type 3 for non-constant request intervals.
What do you think? I'll appreciate if you'll try this on your services.