Implement recursive hashing algorithm

Question

let's say file A has the bytes:

and I have a simple hashing algorithm where I store the sum of the last three consecutive bytes so:

2   
5   
8   - = 8+5+2 = 15
0   
33  
90  - = 90+33+0 = 123
1   
3   
200 - = 204
201 
23  
12  - = 236

so I will be able to represent file A as 15, 123, 204, 236

let's say I copy that file to a new computer B and I made some minor modifications and the bytes of file B are:

"note the difference is an extra byte at the beginning of the file and 2 extra bytes at the end but the rest is very similar"

so I can perform the same algorithm to determine if some parts of the file are the same. Remember that file A was represented by the hash codes 15, 123, 204, 236 let's see if file B gives me some of those hash codes!

so on file B I will have to do it every 3 consecutive bytes

int[] sums; // array where we will hold the sum of the last bytes


255 sums[0]  =          255     
2   sums[1]  =  2+ sums[0]    = 257     
5   sums[2]  =  5+ sums[1]    = 262     
8   sums[3]  =  8+ sums[2]    = 270  hash = sums[3]-sums[0]   = 15   --> MATHES FILE A!
0   sums[4]  =  0+ sums[3]    = 270  hash = sums[4]-sums[1]   = 13
33  sums[5]  =  33+ sums[4]   = 303  hash = sums[5]-sums[2]   = 41
90  sums[6]  =  90+ sums[5]   = 393  hash = sums[6]-sums[3]   = 123  --> MATHES FILE A!
1   sums[7]  =  1+ sums[6]    = 394  hash = sums[7]-sums[4]   = 124
3   sums[8]  =  3+ sums[7]    = 397  hash = sums[8]-sums[5]   = 94
200 sums[9]  =  200+ sums[8]  = 597  hash = sums[9]-sums[6]   = 204  --> MATHES FILE A!
201 sums[10] =  201+ sums[9]  = 798  hash = sums[10]-sums[7]  = 404
23  sums[11] =  23+ sums[10]  = 821  hash = sums[11]-sums[8]  = 424
12  sums[12] =  12+ sums[11]  = 833  hash = sums[12]-sums[9]  = 236  --> MATHES FILE A!
55  sums[13] =  55+ sums[12]  = 888  hash = sums[13]-sums[10] = 90
255 sums[14] =  255+ sums[13] = 1143    hash = sums[14]-sums[11] =  322
255 sums[15] =  255+ sums[14] = 1398    hash = sums[15]-sums[12] =  565

so from looking at that table I know that file B contains the bytes from file A plus additional ones because the hash codes match.

the reason why I show this algorithm is because it was of order n In other words I was able to calculate the hash of the last 3 consecutive bytes without having to iterate through them!

If I where to have a more complex algorithm such as doing md5 of the last 3 bytes then it will be of order n^3 that is because as I iterate through file B I will have to have an inner for loop that will compute the hash of the last three bytes.

So my question is:

how can I improve the algorithm keeping it of order n. That is computing the hash just once. If I use an existing hashing algorithm such as md5 the I will have to place an inner loop inside the algorithm that will significantly increase the order of the algorithm.

Note that it is possible to do the same thing with multiplication instead of addition. but the counter significantly grows really fast. Maybe I can combine multiplication and addition and subtraction...

Edit

Also if I google for:

recursive hashing functions in-grams

a lot of information comes up and I think those algorithms are very difficult to understand...

I have to implement this algorithm for a project that's why I am reinventing the wheel... I know there are a lot of algorithms out there.

Also an alternative solution that I was thinking was to perform the same algorithm plus another one that is strong. so on file A I will perform the same algorithm every 3 bytes plus md5 of each 3 bytes. On the second file I will just perform the second algorithm if the first one comes true....

Sounds like you're reinventing `rsync`. http://en.wikipedia.org/wiki/Rolling_hash might be of interest. — j_random_hacker, Dec 07 '11 at 03:15
Yeah I need to reinvent the wheel unfortunately... I need to do this for a data structures class at my school... — Tono Nam, Dec 07 '11 at 03:19
Why is this order `n!`? I see order `n*m`, where `n` is the number of lines in the file, and `m` is the number of previous values you sum together. If `n` == `m`, all your sums would be the same, which would be worthless, but still would be `n^2`. You may be able to make this order `n` if you use a dequeue, adding a value to the current sum when you push a new value on, and subtracting a value from the current sum when you pop a value off. — Merlyn Morgan-Graham, Dec 07 '11 at 03:29
do you have a constraint to implement recursive? This would be much faster with non-recursive implementation. — Ahmed Yasin Koçulu, Dec 07 '11 at 03:37
It does not have to be recursive. it just cannot be greater than order n^2... — Tono Nam, Dec 07 '11 at 03:39
Even the dumbest-possible hashing algorithm, followed by a straightforward differencing algorithm, will be O(n^2) in total. Even if computing the hash for a block ending at position i requires reading the entire file, you can compute all hashes for the file in O(n^2). Do the same for the other file. Finally, combine them using a longest-common-subsequence algorithm, which has an O(n^2) exact solution (as well as much faster heuristics). — j_random_hacker, Dec 07 '11 at 06:02

Merlyn Morgan-Graham · Accepted Answer · 2011-12-07T05:48:47.297

Edit:

The more I think about what you meant by "recursive", the more I doubt the solution I presented earlier is what you should implement to do anything useful.

You probably want to implement a hash tree algorithm, which is a recursive operation.

To do this you hash the list, divide the list in two, and recurse into those two sub-lists. Terminate either when your list is size 1, or a minimum desired hash size, as each level of recursion will double the size of your total hash output.

Pseudo-code:

create-hash-tree(input list, minimum size: default = 1):
  initialize the output list
  hash-sublist(input list, output list, minimum size)
  return output list

hash-sublist(input list, output list, minimum size):
  add sum-based-hash(list) result to output list // easily swap hash styles here
  if size(input list) > minimum size:
    split the list into two halves
    hash-sublist(first half of list, output list, minimum size)
    hash-sublist(second half of list, output list, minimum size)

sum-based-hash(list):
  initialize the running total to 0

  for each item in the list:
    add the current item to the running total

  return the running total

The running time of this whole algorithm I think is O(hash(m)); m = n * (log(n) + 1), with hash(m) usually being linear time.

The storage space is something like O(hash * s); s = 2n - 1, with the hash usually being constant size.

Note that for C#, I'd make the output list a List<HashType>, but I'd make the input list an IEnumerable<ItemType> to save on storage space, and use Linq to quickly "split" the list without allocating two new sub-lists.

Original:

I think you can get this to be O(n + m) execution time; where n is the size of the list, m is the size of the running tally, and n < m (otherwise all sums would be equal).

With Double-Ended Queue

The memory consumption will be the stack size, plus size m for temporary storage.

To do this, use a double-ended queue, and a running total. Push newly encountered values onto the list while adding to the running total, and when the queue reaches size m, pop off the list and subtract from the running total.

Here's some pseudo-code:

initialize the running total to 0

for each item in the list:
  add the current item to the running total
  push the current value onto the end of the dequeue
  if dequeue.length > m:
    pop off the front of the dequeue
    subtract the popped value from the running total
  assign the running total to the current sum slot in the list

reset the index to the beginning of the list

while the dequeue isn't empty:
  add the item in the list at the current index to the running total
  pop off the front of the dequeue
  subtract the popped value from the running total
  assign the running total to the current sum slot in the list
  increment the index

This is not recursive, it is iterative.

A run of this algorithm looks like this (for m = 3):

value   sum slot   overwritten sum slot
2       2          92
5       7          74
8       15         70
0       15         15
33      46
90      131
1       124
3       127
200     294
201     405
23      427
12      436
55      291

With indexing

You can remove the queue and the re-assignment of any slots by taking a sum of the last m values to start with, and using an offset of your index instead of popping off a dequeue, e.g. array[i - m].

This won't decrease your execution time as you still have to have two loops, one to build up the running tally, and another to populate all the values. But it will decrease your memory usage to stack-space only (effectively O(1)).

Here's some pseudo-code:

initialize the running total to 0

for the last m items in the list:
  add those items to the running total

for each item in the list:
  add the current item to the running total
  subtract the value of the item m slots earlier from the running total
  assign the running total to the current sum slot in the list

The m slots earlier is the tricky part. You can either split this up into two loops:

One that indexes from the end of the list, minus m, plus i
One that indexes from i minus m

Or you can use modulo arithmetic to "wrap" the value around when i - m < 0:

int valueToSutract = array[(i - m) % n];

that is Great thanks. If I understand it correctly I believe you are using the sum as the hash... when using the sum as the hash it finds a matching hash on file B even if the chuncks are not the same. That's why I wanted to implement my algorithm. I don't care if it is slow. I just need it's order to be less than n^2 just like yours thanks and also that the chances of finding a match are very low... Thanks for the help! — Tono Nam, Dec 07 '11 at 04:02
@TonoNam: This algorithm will only work in cases where your hashing algorithm is partially reversable and non-lossy. Bitwise xor would work. Subtracting would work. Multiplication and division would sort of work, except for the problem of divide-by-zero. Also, for the diffing part, check out the [Longest Common Subsequence Problem](http://en.wikipedia.org/wiki/Longest_common_subsequence_problem) — Merlyn Morgan-Graham, Dec 07 '11 at 04:11
@TonoNam: And this isn't recursive :) If you need it to be recursive, or support lossy hashes, let me know and I can expand my answer further. It will be `O(n * m)` though. — Merlyn Morgan-Graham, Dec 07 '11 at 04:12
@TonoNam: `m` would assume the order of your hashing function would be linear time. It would be quite similar to the `dequeue` version, except with no running tally, and the whole queue would be fed into the hashing algorithm without destroying it. Oh, and you might want to ignore that comment on the longest common subsequence - I think that would be rendered pointless/duplicate work by the hashing ;) — Merlyn Morgan-Graham, Dec 07 '11 at 04:21
@TonoNam: The more that I think about it, the more worthless this approach seems to be to me in solving the recursive hash **tree** problem, which is what you're probably trying to implement. Your "every three" solution makes more sense. You don't want it to loop through the whole thing - you probably want to hash the whole thing, then halves, then quarters, etc. — Merlyn Morgan-Graham, Dec 07 '11 at 04:30
Yeah I think I am making more than what I am suppose to... It seems easier at the beginning then when you realize it you are stuck trying to make it better... I will use your algorithm with an additional md5 one. I will just use the md5 algorithm if two hashes match in order to make it more secure and avoid to hashes matching just by pure probability. — Tono Nam, Dec 07 '11 at 04:39
@TonoNam: If you look at the hash tree article, you'll see it is useful because you can use it to do very quick diffs. If the top-level match, then all bits can be assumed to match. If they don't, you can recursively check the hashes to find the exact chunks that don't match. This makes a diff take `log(n)` time. *Edit*: Or maybe they hash the sub-hashes. I don't know man... Can you tell I have heard about this before, but never actually implemented it? ;) — Merlyn Morgan-Graham, Dec 07 '11 at 05:52
yeah you are knowledgeable about this! and plus you look like you like it a lot! LOL. I just want to get done with this assignment. I think I already did... I posted an answer... I am missing to implement the easy stuff — Tono Nam, Dec 07 '11 at 08:03

score 1 · Answer 2 · answered Dec 07 '11 at 05:55

The http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm uses an updatable hash function, which it calls a http://en.wikipedia.org/wiki/Rolling_hash. This is going to be a lot easier to compute that MD5/SHA, and may not be inferior.

You can prove some things about it: it is a polynomial of degree d in a chosen constant a. Suppose that somebody provides two stretches of text and you choose a at random. What is the probability of a collision? Well, if the hash value is the same, subtracting them gives you a polynomial with a as a root. Since there are at most d roots of a non-zero polynomial, and a was chosen at random, the probability is at most modulus / d, which will be very small for large moduli.

Of course MD5/SHA is secure, but see http://cr.yp.to/mac/poly1305-20050329.pdf for a secure variant.

Tono Nam · Answer 3 · 2011-12-07T08:27:03.437

that's what I got so far. I am just missing the steps that should not take time like comparing the array of hashes and opening the files for reading.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace RecursiveHashing
{
    static class Utilities
    {

        // used for circular arrays. If my circular array is of size 5 and it's
        // current position is 2 if I shift 3 units to the left I shouls be in index
        // 4 of the array.
        public static int Shift(this int number, int shift, int divisor)
        {
            var tempa = (number + shift) % divisor;
            if (tempa < 0)
                tempa = divisor + tempa;
            return tempa;
        }

    }
    class Program
    {
        const int CHUNCK_SIZE = 4; // split the files in chuncks of 4 bytes

        /* 
         * formula that I will use to compute hash
         * 
         *      formula =  sum(chunck) * (a[c]+1)*(a[c-1]+1)*(a[c-2]+1)*(-1^a[c])
         *      
         *          where:
         *              sum(chunk)  = sum of current chunck
         *              a[c]        = current byte
         *              a[c-1]      = last byte
         *              a[c-2]      = last last byte
         *              -1^a[c]     = eather -1 or +1  
         *              
         *      this formula is efficient because I can get the sum of any current index by keeping trak of the overal sum
         *      thus this algorithm should be of order n
         */

        static void Main(string[] args)
        {
            Part1(); // Missing implementation to open file for reading
            Part2();
        }



        // fist part compute hashes on first file
        static void Part1()
        {
            // pertend file b reads those bytes
            byte[] FileB = new byte[]{2,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11,};

            // create an array where to store the chashes
            // index 0 will use a fast hash algorithm. index 1 will use a more secure hashing algorithm
            Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];


            // used to track on what index of the file we are at
            int counter = 0;
            byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array  needed to remember the last few bytes
            UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array  needed to remember the last sums
            int index = 0; // position where in circular array

            int numberOfHashes = 0; // number of hashes created so far


            while (counter < FileB.Length)
            {
                int i = 0;
                for (; i < CHUNCK_SIZE; i++)
                {
                    if (counter == 0)
                    {
                        sum[index] = FileB[counter];
                    }
                    else
                    {
                        sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
                    }
                    current[index] = FileB[counter];
                    counter++;

                    if (counter % CHUNCK_SIZE == 0 || counter == FileB.Length)
                    {
                        // get the sum of the last chunk
                        var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);
                        Int64 tempHash = (Int64)a;

                        // conpute my hash function
                        tempHash = tempHash * ((Int64)current[index] + 1)
                                          * ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
                                          * ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
                                          * (Int64)(Math.Pow(-1, current[index]));


                        // add the hashes to the array
                        hashes[numberOfHashes, 0] = tempHash;
                        numberOfHashes++;

                        hashes[numberOfHashes, 1] = -1;// later store a stronger hash function
                        numberOfHashes++;

                        // MISSING IMPLEMENTATION TO STORE A SECOND STRONGER HASH FUNCTION

                        if (counter == FileB.Length)
                            break;
                    }

                    index++;
                    index = index % (CHUNCK_SIZE + 1); // if index is out of bounds in circular array place it at position 0
                }
            }
        }


        static void Part2()
        {
            // simulate file read of a similar file
            byte[] FileB = new byte[]{1,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11};            

            // place where we will place all matching hashes
            Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];


            int counter = 0;
            byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array
            UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array
            int index = 0; // position where in circular array



            while (counter < FileB.Length)
            {
                int i = 0;
                for (; i < CHUNCK_SIZE; i++)
                {
                    if (counter == 0)
                    {
                        sum[index] = FileB[counter];
                    }
                    else
                    {
                        sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
                    }
                    current[index] = FileB[counter];
                    counter++;

                    // here we compute the hash every time and we are missing implementation to 
                    // check if hash is contained by the other file
                    if (counter >= CHUNCK_SIZE)
                    {
                        var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);

                        Int64 tempHash = (Int64)a;

                        tempHash = tempHash * ((Int64)current[index] + 1)
                                          * ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
                                          * ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
                                          * (Int64)(Math.Pow(-1, current[index]));

                        if (counter == FileB.Length)
                            break;
                    }

                    index++;
                    index = index % (CHUNCK_SIZE + 1);
                }
            }
        }
    }
}

same files represented in a table using the same algorithm

                        hashes
bytes       sum Ac  A[c-1]  A[c-2]  -1^Ac   sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
2       2                   
3       5                   
5       10  5   3   2   -1  
8       18  8   5   3   1   3888
2       20  2   8   5   1   
0       20  0   2   8   1   
1       21  1   0   2   -1  
0       21  0   1   0   1   6
0       21  0   0   1   1   
0       21  0   0   0   1   
1       22  1   0   0   -1  
2       24  2   1   0   1   18
4       28  4   2   1   1   
5       33  5   4   2   -1  
6       39  6   5   4   1   
7       46  7   6   5   -1  -7392
8       54  8   7   6   1   
2       56  2   8   7   1   
3       59  3   2   8   -1  
4       63  4   3   2   1   1020
5       68  5   4   3   -1  
6       74  6   5   4   1   
7       81  7   6   5   -1  
8       89  8   7   6   1   13104
11      100 11  8   7   -1  -27648






file b                          
                            rolling hashes
bytes       0   Ac  A[c-1]  A[c-2]  -1^Ac   sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
1       1                   
3       4                   
5       9   5   3   1   -1  
8       17  8   5   3   1   3672
2       19  2   8   5   1   2916
0       19  0   2   8   1   405
1       20  1   0   2   -1  -66
0       20  0   1   0   1   6
0       20  0   0   1   1   2
0       20  0   0   0   1   1
1       21  1   0   0   -1  -2
2       23  2   1   0   1   18
4       27  4   2   1   1   210
5       32  5   4   2   -1  -1080
6       38  6   5   4   1   3570
7       45  7   6   5   -1  -7392
8       53  8   7   6   1   13104
2       55  2   8   7   1   4968
3       58  3   2   8   -1  -2160
4       62  4   3   2   1   1020
5       67  5   4   3   -1  -1680
6       73  6   5   4   1   3780
7       80  7   6   5   -1  -7392
8       88  8   7   6   1   13104
11      99  11  8   7   -1  -27648