2

I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.

I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).

This is the code I came up with for the method:

public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();

// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
  // Puts together the string that the FileReader will refer to.
  String learn = directory + k + ".txt";

try {
  FileReader reader = new FileReader(learn);
  BufferedReader br = new BufferedReader(reader);
  // The BufferedReader reads the lines

  String line = br.readLine();


  // Split the line into a String array to loop through
  String[] words = line.split(" ");
  int freq = 0;

  // for loop goes through every word
  for (int i = 0; i < words.length; i++) {
    // Case if the HashMap already contains the key.
    // If so, just increments the value

    if (wordCount.containsKey(words[i])) {         
      wordCount.put(words[i], freq++);
    }
    // Otherwise, puts the word into the HashMap
    else {
      wordCount.put(words[i], freq++);
    }
  }
  // Catching the file not found error
  // and any other errors
}
catch (FileNotFoundException fnfe) {
  System.err.println("File not found.");
}
catch (Exception e) {
  System.err.print(e);
   }
 }
return wordCount;
}

The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.

If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.

Alex K
  • 7,771
  • 9
  • 35
  • 54
Kommander Kitten
  • 253
  • 1
  • 7
  • 15
  • You should split your code up into different methods. e.g. one method could be `static HashMap frequency(List strings) {...}` – SpiderPig Apr 08 '15 at 23:05

3 Answers3

3

I would have expected something more like this. Does it make sense?

if (wordCount.containsKey(words[i])) { 
  int n = wordCount.get(words[i]);    
  wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
  wordCount.put(words[i], 1);
}

If the word is already in the hashmap, we want to get the current count, add 1 to that and replace the word with the new count in the hashmap.

If the word is not yet in the hashmap, we simply put it in the map with a count of 1 to start with. The next time we see the same word we'll up the count to 2, etc.

jas
  • 10,182
  • 2
  • 29
  • 39
3

Let me combine all the good answers here.

1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.

2) When you split think deeply about how you want to split. As @m0skit0 suggest you should likely split with \b for this problem.

3) As @jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.

4) To print out the map in the way you likely expect, take a look at the below:

Map test = new HashMap();

for (Map.Entry entry : test.entrySet()){
  System.out.println(entry.getKey() + " " + entry.getValue());
}
Michael Hobbs
  • 1,445
  • 1
  • 12
  • 23
  • For printing out the map in a "Key : Value" format, where would I put this? In my main method? In my countWords Method? Also, this should only be done in one method. I figured I would only split by space because that was all the assignment called for. – Kommander Kitten Apr 09 '15 at 00:52
  • @KommanderKitten you could drop it in main or put it in a view function. If you are familiar with MVC. If you not familiar with MVC it is a pattern of how to break code up into logical units. http://stackoverflow.com/questions/2056/what-are-mvp-and-mvc-and-what-is-the-difference – Michael Hobbs Apr 09 '15 at 08:02
  • Nice copy-paste from other answers :) – m0skit0 Apr 09 '15 at 08:14
  • I'm still not sure how to integrate this with my code. I just want to print out: Cat : 1, the : 4, and : 2, etc... – Kommander Kitten Apr 09 '15 at 18:56
  • @KommanderKitten You have 3 valid answers on your problem. Make an effort to integrate it or ask specifically what you don't understand. Do not expect copy-paste code, for your own good, because copy-pasting won't teach you a thing. Good luck. – m0skit0 Apr 09 '15 at 20:52
2

If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. For example: "This phrase, contains... funny stuff", if you split it by space you get: "This" "phrase," "contains..." "funny" and "stuff".

You can avoid this by splitting by word boundary (\b) instead.

line.split("\\b");

Btw your if and else parts are identical. You're always incrementing freq by one, which doesn't make much sense. If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. If not, you put it in the map with a value of 1.

And pro tip: always print/log the full stacktrace for the exceptions.

m0skit0
  • 23,052
  • 11
  • 71
  • 113
  • What do you mean "full stracktrace"? I'm new to coding and I'm not "hip with the lingo". – Kommander Kitten Apr 08 '15 at 22:55
  • http://fr33kk0mpu73r.blogspot.com.es/2013/11/reading-java-exception-stacktraces.html – m0skit0 Apr 08 '15 at 22:57
  • I would use `line.split("[^a-zA-Z]+");` and then call `toLowerCase()`on all words. – SpiderPig Apr 08 '15 at 23:04
  • @m0skit0 he is not getting an error, just not the results he was expecting. – Michael Hobbs Apr 08 '15 at 23:06
  • @SpiderPig Why complicate? And that would only work for English. – m0skit0 Apr 08 '15 at 23:07
  • @MichaelHobbs I know. Still his exception catching is bad. That was a bonus tip, nothing to do with the question. – m0skit0 Apr 08 '15 at 23:10
  • @m0skit0 Ah, I see what you are talking about now, agreed. On a side though I don't always print out the exceptions, but I do log them. I dont feel you should always print them. In an RSS project I did for a site I found the vast majority of the feeds had xml errors which I suppressed from printing but did log. – Michael Hobbs Apr 08 '15 at 23:22
  • @MichaelHobbs That's even better of course, I also log them. – m0skit0 Apr 08 '15 at 23:25
  • @KommanderKitten Anyway that's not part of your problem, just a tip as I said – m0skit0 Apr 08 '15 at 23:26
  • 1
    Thank you for the tip. My professor says that we needn't worry about punctuation and lower/upper case. The program I'm writing is designed to figure out if the given .txt file is written in English or French, and there are certainly enough simple lower case words to deduce that. – Kommander Kitten Apr 09 '15 at 00:44
  • @KommanderKitten That's fine, but if you can do it right in a simple way, it's better, isn't it? Because if for example a word has a "." after it, it will be counted as a different word. Probably your professor doesn't know about `\b` (not much people actually know about it in fact). I've also expanded my answer about why the frequency is messed up. – m0skit0 Apr 09 '15 at 08:13
  • 1
    @SpiderPig YOu are almost right. But it won't work if there is any Aphostropy character. for Eg : Can't - It will divide it into 2 can, t. – Rakesh Gondaliya Nov 02 '15 at 22:12
  • @RakeshGondaliya Thanks, I didn't know this. In fact it divides it in 3: `can`, `'`, `t`. Weird. – m0skit0 Nov 03 '15 at 10:40