I'm still working on the problem mentioned in this post: Sorting vector of strings with leading numbers
The original problem is as follows:
Write a complete C++ program that outputs the k most frequently used words in file input.txt, one per line in descending order of frequency, where k is a nonnegative integer read from input. Ties are broken arbitrarily, and if there are only u different words in input.txt and u < k, then the output has only u entries. For this problem, you may not use any STL class or algorithm except vector and string. A word is a maximal block of non-white-space characters with punctuations removed. Each output line consists of a word followed by its frequency count. (inputs and k-values are given)
Thanks to those who suggested using a struct, I ended up with a little bit more efficient solution with less code.
However, the problem is, for text files that are relatively large (consisted of >400000 words), my program can keep running for more than 5 minutes and gives no result whatsoever. The program runs perfectly on small file inputs. I'm not sure whether it's because the file was too big, or there's a problem with the algorithm itself that causes memory overflow/corruption.
Here's my code for the program:
struct word_freq {
int freq;
string word;
};
bool operator<(const word_freq& a, const word_freq& b) {
return a.freq < b.freq;
}
void word_frequencies(ifstream& inf, int k)
{
vector <string> input;
string w;
while (inf >> w)
{
remove_punc(w);
input.push_back(w);
}
sort(input.begin(), input.end());
// initialize frequency vector
vector <int> freq;
for (size_t i = 0; i < input.size(); ++i) freq.push_back(1);
// count actual frequencies
int count = 0;
for (size_t i = 0; i < input.size()-1; ++i)
{
if (input[i] == input[i+1])
{
++count;
} else
{
freq[i] += count;
count = 0;
}
}
// words+frequencies
vector <word_freq> wf;
for (int i = 0; i < freq.size(); ++i)
{
if (freq[i] > 1 || is_unique(input, input[i]))
{
word_freq st = {freq[i], input[i]};
wf.push_back(st);
}
}
// printing
sort(wf.begin(), wf.end());
if (wf.size() < k)
{
for (int i = wf.size()-1; i >= 0; --i)
{
cout << wf[i].word << " " << wf[i].freq << endl;
}
} else
{
for (int i = wf.size()-1; i >= wf.size()-1-k; --i)
{
cout << wf[i].word << " " << wf[i].freq << endl;
}
}
}
If anyone can point out mistakes made, it would be greatly appreciated.