How do I make an alphabetized list of all distinct words in a file with the number of times each word was used?

Question

I am writing a program using Microsoft Visual C++. In the program I must read in a text file and print out an alphabetized list of all distinct words in that file with the number of times each word was used.

I have looked up different ways to alphabetize a string but they do not work with the way I have my string initialized.

// What is inside my text file
Any experienced programmer engaged in writing programs for use by others knows 
that, once his program is working correctly, good output is a must. Few people 
really care how much time and trouble a programmer has spent in designing and 
debugging a program. Most people see only the results. Often, by the time a 
programmer has finished tackling a difficult problem, any output may look 
great. The programmer knows what it means and how to interpret it. However, 
the same cannot be said for others, or even for the programmer himself six 
months hence.

string lines;
getline(input, lines);      // Stores what is in file into the string

I expect an alphabetized list of words with the number of times each word was used. So far, I do not know how to begin this process.

What are your requirements for the exercise? Can you use e.g. [`std::set`](https://en.cppreference.com/w/cpp/container/set) which makes is *very* simple? Or do you have to implement the set-structure and sorting yourself? — Some programmer dude, May 28 '19 at 04:40
I am not allowed to use pre-defined functions such as sort() or split(). @Someprogrammerdude — ymneedhelp, May 28 '19 at 05:19
So, to be clear, functions are off the table, but are containers (map, set, queue, that sort of thing)? — , May 28 '19 at 05:56
ah. Well, then. I'd take @RSahu 's suggestion and use a std::map. That will be quite useful to you. — , May 28 '19 at 06:17

Everyone · Answer 1 · 2019-05-28T06:44:05.150

It's rather simple, std::map automatically sorts based on key in the key/value pair you get. The key/value pair represents word/count which is what you need. You need to do some filtering for special characters and such.

EDIT: std::stringstream is a nice way of splitting std::string using whitespace delimiter as it's the default delimiter. Therefore, using stream >> word you will get whitespace-separated words. However, this might not be enough due to punctuation. For example: Often, has comma which we need to filter out. Therefore, I used std::replaceif which replaces puncts and digits with whitespaces.

Now a new problem arises. In your example, you have: "must.Few" which will be returned as one word. After replacing . with we have "must Few". So I'm using another stringstream on the filtered "word" to make sure I have only words in the final result.

In the second loop you will notice if(word == "") continue;, this can happen if the string is not trimmed. If you look at the code you will find out that we aren't trimming after replacing puncts and digits. That is, "Often," will be "Often " with trailing whitespace. The trailing whitespace causes the second loop to extract an empty word. This is why I added the condition to ignore it. You can trim the filtered result and then you wouldn't need this check.

Finally, I have added ignorecase boolean to check if you wish to ignore the case of the word or not. If you wish to do so, the program will simply convert the word to lowercase and then add it to the map. Otherwise, it will add the word the same way it found it. By default, ignorecase = true, if you wish to consider case, just call the function differently: count_words(input, false);.

Edit 2: In case you're wondering, the statement counts[word] will automatically create key/value pair in the std::map IF there isn't any key matching word. So when we call ++: if the word isn't in the map, it will create the pair, and increment value by 1 so you will have newly added word. If it exists already in the map, this will increment the existing value by 1 and hence it acts as a counter.

The program:

#include <iostream> 
#include <map>
#include <sstream>
#include <cstring>
#include <cctype>
#include <string>
#include <iomanip>
#include <algorithm>

std::string to_lower(const std::string& str) {
  std::string ret; 
  for (char c : str)
    ret.push_back(tolower(c));
  return ret;
}

std::map<std::string, size_t> count_words(const std::string& str, bool ignorecase = true) {
  std::map<std::string, size_t> counts;

  std::stringstream stream(str);
  while (stream.good()) {
    // wordW may have multiple words connected by special chars/digits
    std::string wordW;
    stream >> wordW;
    // filter special chars and digits
    std::replace_if(wordW.begin(), wordW.end(),
      [](const char& c) { return std::ispunct(c) || std::isdigit(c); }, ' ');

    // now wordW may have multiple words seperated by whitespaces, extract them
    std::stringstream word_stream(wordW);
    while (word_stream.good()) {
      std::string word;
      word_stream >> word;
      // ignore empty words
      if (word == "") continue;
      // add to count. 
      ignorecase ? counts[to_lower(word)]++ : counts[word]++;
    }
  }
  return counts; 
}

void print_counts(const std::map<std::string, size_t>& counts) {
  for (auto pair : counts)
    std::cout << std::setw(15) << pair.first << " : " << pair.second << std::endl;
}

int main() {
  std::string input = "Any experienced programmer engaged in writing programs for use by others knows \
    that, once his program is working correctly, good output is a must.Few people \
    really care how much time and trouble a programmer has spent in designing and \
    debugging a program.Most people see only the results.Often, by the time a \
    programmer has finished tackling a difficult problem, any output may look \
    great.The programmer knows what it means and how to interpret it.However, \
    the same cannot be said for others, or even for the programmer himself six \
    months hence.";

  auto counts = count_words(input); 
  print_counts(counts);
  return 0;
}

I have tested this with Visual Studio 2017 and here is the part of the output:

          a : 5
        and : 3
        any : 2
         be : 1
         by : 2
     cannot : 1
       care : 1
  correctly : 1
  debugging : 1
  designing : 1

score 1 · Answer 2 · answered May 28 '19 at 07:17

As others have already noted, an std::map handles the counting you care about quite easily.

Iostreams already have a tokenize to break an input stream up into words. In this case, we want to to only "think" of letters as characters that can make up words though. A stream uses a locale to make that sort of decision, so to change how it's done, we need to define a locale that classifies characters as we see fit.

struct alpha_only: std::ctype<char> {
    alpha_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table() {
        // everything is white space
        static std::vector<std::ctype_base::mask> 
            rc(std::ctype<char>::table_size,std::ctype_base::space);

        // except lower- and upper-case letters, which are classified accordingly:
        std::fill(&rc['a'], &rc['z'], std::ctype_base::lower);
        std::fill(&rc['A'], &rc['Z'], std::ctype_base::upper);
        return &rc[0];
    }
};

With that in place, we tell the stream to use our ctype facet, then simply read words from the file and count them in the map:

std::cin.imbue(std::locale(std::locale(), new alpha_only));
std::map<std::string, std::size_t> counts;

std::string word;
while (std::cin >> word)
    ++counts[to_lower(word)];

...and when we're done with that, we can print out the results:

for (auto w : counts)
    std::cout << w.first << ": " << w.second << "\n";

score 0 · Answer 3 · answered May 28 '19 at 04:38

Id probably start by inserting all of those words into an array of strings, then start with the first index of the array and compare that with all of the other indexes if you find matches, add 1 to a counter and after you went through the array you could display the word you were searching for and how many matches there were and then go onto the next element and compare that with all of the other elements in the array and display etc. Or maybe if you wanna make a parallel array of integers that holds the number of matches you could do all the comparisons at one time and the displays at one time.

Dylan Gentile · Answer 4 · 2019-05-28T06:25:17.133

EDIT:

Everyone's answer seems more elegant because of the map's inherent sorting. My answer functions more as a parser, that later sorts the tokens. Therefore my answer is only useful to the extent of a tokenizer or lexer, whereas Everyone's answer is only good for sorted data.

You first probably want to read in the text file. You want to use a streambuf iterator to read in the file(found here). You will now have a string called content, which is the content of you file. Next you will want to iterate, or loop, over the contents of this string. To do that you'll want to use an iterator. There should be a string outside of the loop that stores the current word. You will iterate over the content string, and each time you hit a letter character, you will add that character to your current word string. Then, once you hit a space character, you will take that current word string, and push it back into the wordString vector. (Note: that means that this will ignore non-letter characters, and that only spaces denote word separation.)

Now that we have a vector of all of our words in strings, we can use std::sort, to sort the vector in alphabetical order.(Note: capitalized words take precedence over lowercase words, and therefore will be sorted first.) Then we will iterate over our vector of stringWords and convert them into Word objects (this is a little heavy-weight), that will store their appearances and the word string. We will push these Word objects into a Word vector, but if we discover a repeat word string, instead of adding it into the Word vector, we'll grab the previous entry and increment its appearance count.

Finally, once this is all done, we can iterate over our Word object vector and output the word followed by its appearances.

Full Code:

#include <vector>
#include <fstream>
#include <iostream>
#include <streambuf>
#include <algorithm>
#include <string>

class Word //define word object
{
public:
    Word(){appearances = 1;}
    ~Word(){}
    int appearances;
    std::string mWord;
};

bool isLetter(const char x)
{
    return((x >= 'a' && x <= 'z') || (x >= 'A' && x <= 'Z'));
}

int main()
{
    std::string srcFile = "myTextFile.txt"; //what file are we reading
    std::ifstream ifs(srcFile);
    std::string content( (std::istreambuf_iterator<char>(ifs) ),
                       (  std::istreambuf_iterator<char>()    )); //read in the file
    std::vector<std::string> wordStringV; //create a vector of word strings
    std::string current = ""; //define our current word
    for(auto it = content.begin(); it != content.end(); ++it) //iterate over our input
    {
        const char currentChar = *it; //make life easier
        if(currentChar == ' ')
        {
            wordStringV.push_back(current);
            current = "";
            continue;
        }
        else if(isLetter(currentChar))
        {
            current += *it;
        }
    }

    std::sort(wordStringV.begin(), wordStringV.end(), std::less<std::string>());
    std::vector<Word> wordVector;

    for(auto it = wordStringV.begin(); it != wordStringV.end(); ++it) //iterate over wordString vector
    {
        std::vector<Word>::iterator wordIt;
        //see if the current word string has appeared before...
        for(wordIt = wordVector.begin(); wordIt != wordVector.end(); ++wordIt) 
        {
            if((*wordIt).mWord == *it)
                break;
        }
        if(wordIt == wordVector.end()) //...if not create a new Word obj
        {
            Word theWord;
            theWord.mWord = *it;
            wordVector.push_back(theWord);
        }
        else //...otherwise increment the appearances.
        {
            ++((*wordIt).appearances);
        }
    }
    //print the words out
    for(auto it = wordVector.begin(); it != wordVector.end(); ++it)
    {
        Word theWord = *it;
        std::cout << theWord.mWord << " " << theWord.appearances << "\n";
    }

    return 0;
}

Side Notes

Compiled with g++ version 4.2.1 with target x86_64-apple-darwin, using the compiler flag -std=c++11.

If you don't like iterators you can instead do

for(int i = 0; i < v.size(); ++i)
{
    char currentChar = vector[i];
}

It's important to note that if you are capitalization agnostic simply use std::tolower on the current += *it; statement (ie: current += std::tolower(*it);).
Also, you seem like a beginner and this answer might have been too heavyweight, but you're asking for a basic parser and that is no easy task. I recommend starting by parsing simpler strings like math equations. Maybe make a calculator app.

How do I make an alphabetized list of all distinct words in a file with the number of times each word was used?

4 Answers4

EDIT:

Side Notes