0

I was assigned this lab in which I needed to create a hash function, and count the number of collisions that occur when hashing a file ranging up to 30000 elements. Here is my code so far

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

long hashcode(string s){
  long seed = 31; 
  long hash = 0;
  for(int i = 0; i < s.length(); i++){
    hash = (hash * seed) + s[i];
  }
  return hash % 10007;
};

int main(int argc, char* argv[]){
  int count = 0;
  int collisions = 0;
  fstream input(argv[1]);
  string x;
  int array[30000];

  //File stream
  while(!input.eof()){
    input>>x;
    array[count] = hashcode(x);
    count++;
    for(int i = 0; i<count; i++){
        if(array[i]==hashcode(x)){
            collisions++;
        }
    }
  }
  cout<<"Total Input is " <<count-1<<endl;
  cout<<"Collision # is "<<collisions<<endl;
}

I am just not sure of how to count the number of collisions. I tried storing every hashed value to an array and then search that array, but it resulted in like 12000 collisions when there were only 10000 elements. Any advice at all on how to count the collisions or even if my hash function could use improvement, would be appreciated. Thank you.

paul5345
  • 3
  • 1
  • 5
  • 3
    [`while (!eof())` is wrong.](http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong) – chris Apr 09 '17 at 15:55
  • @chris that's what my professor precoded for us though – paul5345 Apr 09 '17 at 16:02
  • See http://stackoverflow.com/questions/8317508/hash-function-for-a-string about hash function for a string. Normally a hash is used to index into the hash table. so your logic is a bit strange. – Richard Chambers Apr 09 '17 at 16:05
  • @RichardChambers His post one of the ones I used to build my hash function, my professor doesn't want them put into a hash table, he just wants them hashed and the number of collisions counted – paul5345 Apr 09 '17 at 16:07
  • That doesn't change the fact that it isn't reliable. If input fails, infinite loop. Also consider the case [where the input ends with whitespace](https://wandbox.org/permlink/pPXrjTyuOIibjhAk). As you can see, that causes undesirable behaviour as well. – chris Apr 09 '17 at 16:09
  • @paul5345: Many people find it hard to use the I/O streams library correctly. Including myself a lot of times, and your professor, as it would seem. It's just a very old library with a questionable interface, because it's so extremely easy to use incorrectly. In any case, please read the explanation for why `while(!eof())` is wrong and considering showing it to your professor so that the mistake can be fixed. – Christian Hackl Apr 09 '17 at 16:42

3 Answers3

3

The problem is you're recounting collisions (Suppose you had 4 of the same elements in your list and nothing else, and go through your algorithm to see how many collisions you'd count)

Instead, create a set of hashcodes and each time you compute a hashcode, check if it's in the set. If it's in the set, increment total number of collisions. If it's not in the set, add it to the set.

Edit:

To quickly patch your algorithm, I've done the following: incremented count after the loop and broken out of the for loop once I find a collision. This is still not super efficient since we're looping through all the results (using a set data structure would be faster) but this should at least be correct.

Also tweaked it so we don't calculate hashcode(x) over and over:

int main(int argc, char* argv[]){
  int count = 0;
  int collisions = 0;
  fstream input(argv[1]);
  string x;
  int array[30000];

  //File stream
  while(!input.eof()){
    input>>x;
    array[count] = hashcode(x);
    for(int i = 0; i<count; i++){
        if(array[i]==array[count]){
            collisions++;
            // Once we've found one collision, we don't want to count all of them.
            break;
        }
    }
    // We don't want to check our hashcode against the value we just added
    // so we should only increment count here.
    count++;
  }
  cout<<"Total Input is " <<count-1<<endl;
  cout<<"Collision # is "<<collisions<<endl;
}
muzzlator
  • 732
  • 4
  • 9
  • If i were to add them all to an array, how would I be able to search for each specific hash code after the while loop? – paul5345 Apr 09 '17 at 16:08
  • I used the term set rather than array (they have different meanings in programming), where I had intended the lookup to be done inside the while loop. Ignore all that for now: I've edited my solution to patch up your attempt rather than necessarily providing the best solution – muzzlator Apr 09 '17 at 16:11
  • I tried to use the break and it resulted in 10001 collisions, but then I saw this and changed it so it increments the counts after the loop and now I only get 2093 collisions which is a lot better. My professor's sample output results in about 4700 collisions, I saw that my hash function emits negative numbers sometimes, is that a thing that's allowed? And I think I will attempt to look into the set data structure because the program does take quite a while a run with this method, thank you a lot though. – paul5345 Apr 09 '17 at 16:15
  • If you want to speed things up a bit in the program's current form, avoid calculating hashcode(x) repeatedly in the inner loop. You can set it to a variable and calculate it only once. – muzzlator Apr 09 '17 at 16:17
  • Wow once again thank you so much, that made it 10x faster – paul5345 Apr 09 '17 at 16:22
  • Cool, also look at the references in the other answer to learn how to use sets, once you master these and understand their performance characteristics, you will significantly improve your algorithmic awareness. Re: negative hash codes, I come from java-land where it's perfectly acceptable, I'm guessing it's completely fine in C++ too – muzzlator Apr 09 '17 at 16:25
1

Answer added in the interests of education. It's probably your professor's next lesson.

Almost certainly the most efficient way to detect hash collision is to use a hash set (a.k.a. unordered_set)

#include <iostream>
#include <unordered_set>
#include <fstream>
#include <string>

// your hash algorithm
long hashcode(std::string const &s) {
    long seed = 31;
    long hash = 0;
    for (int i = 0; i < s.length(); i++) {
        hash = (hash * seed) + s[i];
    }
    return hash % 10007;
};

int main(int argc, char **argv) {
    std::ifstream is{argv[1]};
    std::unordered_set<long> seen_before;
    seen_before.reserve(10007);
    std::string buffer;
    int collisions = 0, count = 0;
    while (is >> buffer) {
        ++count;
        auto hash = hashcode(buffer);
        auto i = seen_before.find(hash);
        if (i == seen_before.end()) {
            seen_before.emplace_hint(i, hash);
        }
        else {
            ++collisions;
        }
    }
    std::cout << "Total Input is " << count << std::endl;
    std::cout << "Collision # is " << collisions << std::endl;
}
Richard Hodges
  • 64,204
  • 6
  • 75
  • 124
  • It would be even faster to fill the array with hash results, sort it afterwards and then go once through sorted array to count the number of repeating elements. – Andrey Turkin Apr 09 '17 at 16:57
  • @Andrey why? that doesn't seem faster than this – muzzlator Apr 09 '17 at 17:25
  • Btw @Richard, could we just use emplace rather than emplace_hint and look at the bool returned in the return value to decide whether to increment counter? (http://www.cplusplus.com/reference/unordered_set/unordered_set/emplace/) (Also I'm trying to understand why you did it this way, I know rather little C++ so if my suggestion has a drawback, I'd be interested in finding out) – muzzlator Apr 09 '17 at 17:33
  • @AndreyTurkin fill array is O(N), sort is O(N.logN), iterate sorted array is O(N). Inserting querying an unordered_set is constant time and emplace_hint is worst case constant time. – Richard Hodges Apr 09 '17 at 17:37
  • @muzzlator yup, that would work. I just don't like insert's semantics. – Richard Hodges Apr 09 '17 at 17:38
  • @RichardHodges Worst case for all of them is linear time, and (depending on a load factor, number of grows and on cache pressure) O(N) might be slower than O(N log N) for some N. Your code will be faster than sorting an array though, but you are not really using unordered_set. You reserve enough space to fit whole input range so hash map is reduced to a bitmap. You could've as well used `bool seen_before[10007]` or std::bitset which would take less space, less allocations and be quite a bit faster. – Andrey Turkin Apr 10 '17 at 02:37
  • I gave it a bit more though. Vector&sort solution should be able to compete with unordered_set but only in a limited case (number of inputs is reasonably small, range of hash outputs is sufficiently large to be at least comparable with number of inputs, and hash function is "good enough" to give a somewhat linear distribution) - in such conditions cost of memory allocations for hash buckets should be enough to offset cost of sorting. If number of inputs is much larger than number of outputs then unordered_set will easily beat sorting. – Andrey Turkin Apr 10 '17 at 04:46
0

For an explanation of hash tables see How does a hash table work?

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

// Generate a hash code that is in the range of our hash table.
// The range we are using is zero to 10,007 so that our table is
// large enough and the prime number size reduces the probability
// of collisions from different strings hashing to the same value.
unsigned long hashcode(string s){
    unsigned long seed = 31;
    unsigned long hash = 0;
    for (int i = 0; i < s.length(); i++){
        hash = (hash * seed) + s[i];
    }
    // we want to generate a hash code that is the size of our table.
    // so we mod the calculated hash to ensure that it is in the proper range
    // of our hash table entries. 10007 is a prime number which provides
    // better characteristics than a non-prime number table size.
    return hash % 10007; 
};

int main(int argc, char * argv[]){
    int count = 0;
    int collisions = 0;
    fstream input(argv[1]);
    string x;
    int array[30000] = { 0 };

    //File stream
    while (!input.eof()){
        input >> x;     // get the next string to hash
        count++;        // count the number of strings hashed.
        // hash the string and use the hash as an index into our hash table.
        // the hash table is only used to keep a count of how many times a particular
        // hash has been generated. So the table entries are ints that start with zero.
        // If the value is greater than zero then we have a collision.
        // So we use postfix increment to check the existing value while incrementing
        // the hash table entry.
        if ((array[hashcode(x)]++) > 0)
            collisions++;
    }
    cout << "Total Input is " << count << endl;
    cout << "Collision # is " << collisions << endl;

    return 0;
}
Community
  • 1
  • 1
Richard Chambers
  • 14,509
  • 3
  • 62
  • 86