let's say file A has the bytes:
2
5
8
0
33
90
1
3
200
201
23
12
55
and I have a simple hashing algorithm where I store the sum of the last three consecutive bytes so:
2
5
8 - = 8+5+2 = 15
0
33
90 - = 90+33+0 = 123
1
3
200 - = 204
201
23
12 - = 236
so I will be able to represent file A as 15, 123, 204, 236
let's say I copy that file to a new computer B and I made some minor modifications and the bytes of file B are:
255
2
5
8
0
33
90
1
3
200
201
23
12
255
255
"note the difference is an extra byte at the beginning of the file and 2 extra bytes at the end but the rest is very similar"
so I can perform the same algorithm to determine if some parts of the file are the same. Remember that file A was represented by the hash codes 15, 123, 204, 236
let's see if file B gives me some of those hash codes!
so on file B I will have to do it every 3 consecutive bytes
int[] sums; // array where we will hold the sum of the last bytes
255 sums[0] = 255
2 sums[1] = 2+ sums[0] = 257
5 sums[2] = 5+ sums[1] = 262
8 sums[3] = 8+ sums[2] = 270 hash = sums[3]-sums[0] = 15 --> MATHES FILE A!
0 sums[4] = 0+ sums[3] = 270 hash = sums[4]-sums[1] = 13
33 sums[5] = 33+ sums[4] = 303 hash = sums[5]-sums[2] = 41
90 sums[6] = 90+ sums[5] = 393 hash = sums[6]-sums[3] = 123 --> MATHES FILE A!
1 sums[7] = 1+ sums[6] = 394 hash = sums[7]-sums[4] = 124
3 sums[8] = 3+ sums[7] = 397 hash = sums[8]-sums[5] = 94
200 sums[9] = 200+ sums[8] = 597 hash = sums[9]-sums[6] = 204 --> MATHES FILE A!
201 sums[10] = 201+ sums[9] = 798 hash = sums[10]-sums[7] = 404
23 sums[11] = 23+ sums[10] = 821 hash = sums[11]-sums[8] = 424
12 sums[12] = 12+ sums[11] = 833 hash = sums[12]-sums[9] = 236 --> MATHES FILE A!
55 sums[13] = 55+ sums[12] = 888 hash = sums[13]-sums[10] = 90
255 sums[14] = 255+ sums[13] = 1143 hash = sums[14]-sums[11] = 322
255 sums[15] = 255+ sums[14] = 1398 hash = sums[15]-sums[12] = 565
so from looking at that table I know that file B contains the bytes from file A plus additional ones because the hash codes match.
the reason why I show this algorithm is because it was of order n In other words I was able to calculate the hash of the last 3 consecutive bytes without having to iterate through them!
If I where to have a more complex algorithm such as doing md5 of the last 3 bytes then it will be of order n^3 that is because as I iterate through file B I will have to have an inner for loop that will compute the hash of the last three bytes.
So my question is:
how can I improve the algorithm keeping it of order n. That is computing the hash just once. If I use an existing hashing algorithm such as md5 the I will have to place an inner loop inside the algorithm that will significantly increase the order of the algorithm.
Note that it is possible to do the same thing with multiplication instead of addition. but the counter significantly grows really fast. Maybe I can combine multiplication and addition and subtraction...
Edit
Also if I google for:
recursive hashing functions in-grams
a lot of information comes up and I think those algorithms are very difficult to understand...
I have to implement this algorithm for a project that's why I am reinventing the wheel... I know there are a lot of algorithms out there.
Also an alternative solution that I was thinking was to perform the same algorithm plus another one that is strong. so on file A I will perform the same algorithm every 3 bytes plus md5 of each 3 bytes. On the second file I will just perform the second algorithm if the first one comes true....