4

I have a few questions about an assignment that i need to do. It might seem that what im looking for is to get the code, however, what im trying to do is to learn because after weeks of searching for information im lost. Im really new atC`.

Here is the assignment :

  • Given 3 files (foo.txt , bar.txt , foo2.txt) they all have a different amount of words (I need to use dynamic memory).

Create a program that ask for a word and tells you if that word is in any of the documents (the result is the name of the document where it appears).

Example :

  • Please enter a word: dog
  • "dog" is in foo.txt and bar.txt

(I guess i need to load the 3 files, create a hash table that has the keyvalues for every word in the documents but also has something that tells you which one is the document where the word is at).

I guess i need to implement:

  • A Hash Function that converts a word into a HashValue
  • A Hash Table that stores the HashValue of every word (But i think i should also store the document index?).
  • Use of dynamic allocation.
  • Check for collisions while im inserting values into the hash table (Using Quadratic Probing and also Chaining).
  • Also i need to know how many times the word im looking for appears in the text.

I've been searching about hashmaps implementations, hash tables , quadratic probing, hash function for strings...but my head is a mess right now and i dont really now from where i should start.

so far i`ve read :

Algorithm to get a list of all words that are anagrams of all substrings (scrabble)?

Implementing with quadratic probing

Does C have hash/dictionary data structure?

https://gist.github.com/tonious/1377667

hash function for string

http://www.cs.yale.edu/homes/aspnes/pinewiki/C(2f)HashTables.html?highlight=(CategoryAlgorithmNotes)

https://codereview.stackexchange.com/questions/115843/dictionary-implementation-using-hash-table-in-c

Sorry for my english in advance.

Hope you can help me.

Thanks.

FIRST EDIT

  • Thanks for the quick responses.
  • I'm trying to put all together and try something, however @Shasha99 I cannot use the TRIE data structure, i'm checking the links you gave me.
  • @MichaelDorgan Thanks for posting a solution for beginners however i must use Hashing (It's for Algorithms and Structures Class) and the teacher told us we MUST implement a Hash Function , Hash Table and probably another structure that stores important information.

After thinking for an hour I tried the following :

  • A Structure that stores the word, the number of documents where it appears and the index of those documents.
    typedef struct WordMetadata {
        char* Word;
        int Documents[5];
        int DocumentsCount;
    } WordMetadata;
  • A function that Initializes that structure
       void InitTable (WordMetadata **Table) {
            Table = (WordMetadata**) malloc (sizeof(WordMetadata) * TABLESIZE);
            for (int i = 0; i < TABLESIZE; i++) {
                Table[i] = (WordMetadata*) NULL;
            }
        }
  • A function that Loads to memory the 3 documents and index every word inside the hash table.

  • A function that index a word in the mentioned structure

  • A function that search for the specific word using Quadratic Probing (If i solve this i will try with the chaining one...).

  • A function that calculates the hash value of a word (I think i will use djb2 or any of the ones i found here http://www.cse.yorku.ca/~oz/hash.html) but for now :

 int Hash (char *WordParam) {

            for (int i = 0; *WordParam != '\0';) {

                i += *WordParam++;

            }

            return (i % TABLESIZE);}

EDIT 2

I tried to implement something, its not working but would take a look and tell me what is wrong (i know the code is a mess)

EDIT 3

This code is properly compiling and running, however , some words are not finded (maybe not indexed i' dont know), i'm thinking about moving to another hashfunction as i mentioned in my first message.

  • Approximately 85% of the words from every textfile (~ 200 words each) are correctly finded by the program.

  • The other ones are ramdom words that i think are not indexed correctly or maybe i have an error in my search function...

Here is the current (Fully functional) code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define TABLESIZE 4001
#define LINESIZE 2048
#define DELIMITER " \t"

typedef struct TTable {
    char*   Word;               /* The actual word  */
    int     Documents[5];           /* Documents Index */
    int     DocumentsCount;             /* Number of documents where the word exist */
} TTable;


int Hash (char *Word);
void Index (TTable **HashTable, char* Word, int DocumentIndex);
int Search (TTable **HashTable, char* Word);
int mystrcmp(char *s1, char *s2);
char* Documents[] = {"foo.txt","bar.txt","foo2.txt",NULL};


int main() {

    FILE* file;
    TTable **HashTable
    int DocumentIndex;
    char Line[LINESIZE];
    char* Word;
    char* Tmp;

    HashTable = (TTable**) malloc (sizeof(TTable)*TABLESIZE);
    for (int i = 0; i < TABLESIZE; i++) {
      HashTable[i] = (TTable*) NULL;
    }

    for (DocumentIndex = 0; Documents[DocumentIndex] != NULL; DocumentIndex++) {

      file = fopen(Documents[DocumentIndex],"r");
      if (file == NULL) {

          fprintf(stderr, "Error%s\n", Documents[DocumentIndex]);
          continue;

      }


      while (fgets (Line,LINESIZE,file) != NULL) {

          Line[LINESIZE-1] = '\0';
          Tmp = strtok (Line,DELIMITER);

          do {

              Word = (char*) malloc (strlen(Tmp)+1);
              strcpy(Word,Tmp);
              Index(HashTable,Word,DocumentIndex);
              Tmp = strtok(NULL,DELIMITER);
          } while (Tmp != NULL);

      }

        fclose(file);

    }


        printf("Enter the word:");
        fgets(Line,100,stdin);
        Line[strlen(Line)-1]='\0'; //fgets stores newline as well. so removing newline.
        int i = Search(HashTable,Line);
        if (i != -1) {
          for (int j = 0; j < HashTable[i]->DocumentsCount; j++) {
            printf("%s\n", Documents[HashTable[i]->Documents[j]]);
            if ( j < HashTable[i]->DocumentsCount-1) {

                printf(",");
            }
          }
        }

        else {
          printf("Cant find word\n");
        }


        for (i = 0; i < TABLESIZE; i++) {
          if (HashTable[i] != NULL) {

              free(HashTable[i]->Word);
              free(HashTable[i]);

          }
        }


return 0;
}

/* Theorem: If TableSize is prime and ? < 0.5, quadratic
probing will always find an empty slot
*/
int Search (TTable **HashTable, char* Word) {

    int Aux = Hash(Word);
    int OldPosition,ActualPosition;

    ActualPosition = -1;

    for (int i = 0; i < TABLESIZE; i++) {
      OldPosition = ActualPosition;
      ActualPosition = (Aux + i*i) % TABLESIZE;

      if (HashTable[ActualPosition] == NULL) {
        return -1;
      }

    if (strcmp(Word,HashTable[ActualPosition]->Word) == 0) {

        return ActualPosition;

    }
    }

    return -1; // Word not found
}


void Index (TTable **HashTable, char* Word, int DocumentIndex) {

    int Aux; //Hash value
    int OldPosition, ActualPosition;

    if ((ActualPosition = Search(HashTable,Word)) != -1) {

        for (int j = 0; j < HashTable[ActualPosition]->DocumentsCount;j++) {

            if(HashTable[ActualPosition]->Documents[j] == DocumentIndex) {
              return;
            }

        }

        HashTable[ActualPosition]->Documents[HashTable[ActualPosition]->DocumentsCount] = DocumentIndex;        HashTable[ActualPosition]->DocumentsCount++;
        return;
    }

    ActualPosition = -1;
    Aux = Hash(Word);

    for (int i = 0; i < TABLESIZE; i++) {

        OldPosition = ActualPosition;
        ActualPosition = (Aux + i*i) % TABLESIZE;
        if (OldPosition == ActualPosition) {
          break;
        }

    if (HashTable[ActualPosition] == NULL) {

        HashTable[ActualPosition] = (TTable*)malloc (sizeof(TTable));
        HashTable[ActualPosition]->Word = Word;
        HashTable[ActualPosition]->Documents[0] = DocumentIndex;
        HashTable[ActualPosition]->DocumentsCount = 1;
        return;
    }

    }

    printf("No more free space\n");

}


int Hash (char *Word) {

    int HashValue;
    for (HashValue = 0; *Word != '\0';) {
      HashValue += *Word++;
    }

    return (HashValue % TABLESIZE);
}
Community
  • 1
  • 1
ODB8
  • 45
  • 6
  • Do you need to show where the word is at? It sounds like from the question, all you have to do is determine if the word exists in the file or not. You should be able to do a simple character search of the file and if you have a match, then return the name of the file. – Daniel Congrove Nov 15 '16 at 19:24
  • What problem are you facing now ? Where is your code failing ? – Shasha99 Nov 16 '16 at 18:43
  • Edited my answer and shown the possible problems in your code. Let me know if it works !!! – Shasha99 Nov 17 '16 at 07:02

2 Answers2

1

I would suggest you to use TRIE data structure for storing strings present in all three files in memory as Hash would be more space consuming. As the first step you should read all three files one by one and for each word in file_i, you should do the following:

  1. if the word is already present in TRIE, append the file index to that node or update the word count relative to that particular file. You may need 3 variables for file1, file and file3 at each node to store the values of word count.
  2. if the word is not present, add the word and the file index in TRIE node.

Once you are done with building your TRIE, checking whether the word is present or not would be an O(1) operation.


If you are going with Hash Tables, then:

  1. You should start with how to get hash values for strings.
  2. Then read about open addressing, probing and chaining
  3. Then understand the problems in open addressing and chaining approaches.
  4. How will you delete and element in hash table with open addressing and probing ? here
  5. How will the search be performed in case of chaining ? here
  6. Making a dynamic hash table with open addressing ? Amortized analysis here and here.
  7. Comparing between chaining and open addressing. here.
  8. Think about how these problems can be resolved. May be TRIE ?


Problem in the code of your EDIT 2:

An outstanding progress from your side !!!

After a quick look, i found the following problems:

Don't use gets() method, use fgets() instead So replace:

gets(Line);

with the following:

fgets(Line,100,stdin);
Line[strlen(Line)-1]='\0'; //fgets stores newline as well. so removing newline.

The line:

if ( j < HashTable[j]->DocumentsCount-1){

is causing segmentation fault. I think you want to access HashTable[i]:

if ( j < HashTable[i]->DocumentsCount-1){

In the line:

HashTable[ActualPosition]->Documents[HashTable[ActualPosition]->DocumentsCount];

You were supposed to assign some value. May be this:

HashTable[ActualPosition]->Documents[HashTable[ActualPosition]->DocumentsCount] = DocumentIndex;


Malloc returns void pointer. You should cast it to the appropriate one:

HashTable[ActualPosition] = (TTable*)malloc (sizeof(TTable));

You should also initialize the Documents array with default value while creating a new node in Hash:

for(j=0;j<5;j++)HashTable[ActualPosition]->Documents[j]=-1;


You are removing everything from your HashTable after finding the first word given by user. May be you wanted to place that code outside the while loop.

Your while loop while(1) does not have any terminating condition, You should have one.

All the best !!!

Community
  • 1
  • 1
Shasha99
  • 1,586
  • 1
  • 13
  • 29
  • Hello!!!, I got it working with the solutions you gave me! Thank you so much for your help. I tried to search a few words, the results are pretty good, for example, i think for every document (around 200 words) 85% are correctly indexed but the program can't find words with spanish alphabet ("Ponía") and also other words that should be found are not being indexed or finded correctly, any way i can improve the code?. Last question is where do i need to initialize the documents array? Thank you so much! – ODB8 Nov 17 '16 at 21:31
  • @ODB8, If you can provide me the files and your updated code, i may check for rest of the issues. Please upvote if you find my answer helpful. Also the document array should be initialized when you find a new word and want to insert it in hash for the first time. So in your index() method inside the following condition: if (HashTable[ActualPosition] == NULL){...... – Shasha99 Nov 18 '16 at 06:27
  • Hi Shasha, I updated the code in my first message (EDIT 3) is the last one i have. I'm working right now. Here is the list of files and the project, thanks!!! https://github.com/fsanchez94/re – ODB8 Nov 18 '16 at 07:38
  • @ODB8, the problem i see is that your file also contains punctuation i.e. ',' '.' etc. So for example consider this line: "i am shashank awasthi." Now if you try to find the word "awasthi" , you wont find it because the word stored in dictionary was "awasthi." and not "awasthi" . So while inserting the word, you should remove the punctuation character if present which is the last character of the word and you are through. – Shasha99 Nov 18 '16 at 08:41
  • @ODB8, Have a look here. I have put some debug statements in your code https://code.hackerearth.com/9c26bad – Shasha99 Nov 18 '16 at 09:11
0

For a school assignment, you probably don't need to worry about hashing. For a first pass, you can just get away with a straight linear search instead:

  1. Create 3 pointers to char arrays (or a char ** if you prefer), one for each dictionary file.
  2. Scan each text/dictionary file to see how many individual words reside within it. Depending on how the file is formatted, this may be spaces, strings, newlines, commas, etc. Basically, count the words in the file.
  3. Allocate an array of char * times the word count in each file and store it in the char ** for that file. (if 100 words found in the file , num_words=100; fooPtr = malloc(sizeof(char *) * num_words);
  4. Go back through the file a second time and allocate an array of chars to the size of each word in the file and store it in the previously created array. You now have a "jagged 2D array" for every word in each dictionary file.

Now, you have 3 arrays for your dictionaries and can use them to scan for words directly.

When given a word, setup a for loop to look through each file's char array. if the entered word matches with the currently scanned dictionary, you have found a match and should print the result. Once you have scanned all dictionaries, you are done.

Things to make it faster:

  1. Sort each dictionary, then you can binary search them for matches (O(log n)).
  2. Create a hash table and add each string to it for O(1) lookup time (This is what most professional solutions would do, which is why you found so many links on this.)

I've offered almost no code here, just a method. Give it a shot.

One final note - even if you decide to use a the hash method or a list or whatever, the code you write with arrays will still be useful.

Michael Dorgan
  • 12,091
  • 2
  • 27
  • 61
  • Why 3 dictionaries, you may create a single dictionary and map word -> file indexes. Also most professional solutions would use TRIE or maybe suffix trees if any advanced requirements. – Shasha99 Nov 15 '16 at 20:07
  • Because he wanted to know which file the word came from. And why my do it this way at all? Because he is a beginner at C. – Michael Dorgan Nov 15 '16 at 21:27
  • Ya that can also be done using single dictionary i think. – Shasha99 Nov 17 '16 at 09:24