Word frequency program - file input too large?

Question

I'm still working on the problem mentioned in this post: Sorting vector of strings with leading numbers

The original problem is as follows:

Write a complete C++ program that outputs the k most frequently used words in file input.txt, one per line in descending order of frequency, where k is a nonnegative integer read from input. Ties are broken arbitrarily, and if there are only u different words in input.txt and u < k, then the output has only u entries. For this problem, you may not use any STL class or algorithm except vector and string. A word is a maximal block of non-white-space characters with punctuations removed. Each output line consists of a word followed by its frequency count. (inputs and k-values are given)

Thanks to those who suggested using a struct, I ended up with a little bit more efficient solution with less code.

However, the problem is, for text files that are relatively large (consisted of >400000 words), my program can keep running for more than 5 minutes and gives no result whatsoever. The program runs perfectly on small file inputs. I'm not sure whether it's because the file was too big, or there's a problem with the algorithm itself that causes memory overflow/corruption.

Here's my code for the program:

struct word_freq {
int freq;
string word;
};

bool operator<(const word_freq& a, const word_freq& b) {
    return a.freq < b.freq;
}
void word_frequencies(ifstream& inf, int k)
{
vector <string> input;
string w;
while (inf >> w)
{
    remove_punc(w);
    input.push_back(w);
}
sort(input.begin(), input.end());

// initialize frequency vector
vector <int> freq;
for (size_t i = 0; i < input.size(); ++i) freq.push_back(1);

// count actual frequencies
int count = 0;
for (size_t i = 0; i < input.size()-1; ++i)
{
    if (input[i] == input[i+1])
    {
        ++count;
    } else
    {
        freq[i] += count;
        count = 0;
    }
}

// words+frequencies
vector <word_freq> wf;
for (int i = 0; i < freq.size(); ++i)
{
    if (freq[i] > 1 || is_unique(input, input[i]))
    {
        word_freq st = {freq[i], input[i]};
        wf.push_back(st);
    }
}

// printing
sort(wf.begin(), wf.end());
if (wf.size() < k)
{
    for (int i = wf.size()-1; i >= 0; --i)
    {
        cout << wf[i].word << " " << wf[i].freq << endl;
    }
} else
{
    for (int i = wf.size()-1; i >= wf.size()-1-k; --i)
    {
        cout << wf[i].word << " " << wf[i].freq << endl;
    }
}
}

If anyone can point out mistakes made, it would be greatly appreciated.

A debugger will tell you exactly what's going on. Let your application run for a bit then pause it to see what it's currently doing. — François Andrieux, May 11 '17 at 18:37
Did you try stepping through your code with a debugger? Or, attach to your program with a debugger, when you think, that it halted, to investigate it? — Algirdas Preidžius, May 11 '17 at 18:38
You say you cannot use algos or containers except `std::vector` or `std::string`, but you are using `std::sort()` what else you are not telling us? — Slava, May 11 '17 at 18:39
If you want people to help you better help people first. Your code is difficult to read. — Slava, May 11 '17 at 19:09
`remove_punc` - your code is missing definition of this function. — Snps, May 11 '17 at 19:11
If you can't use the standard library (except streams I hope), you supposedly either have to implement your own sort function, or a custom hash map to solve this (unless you wanna go `O(n³+)`). Can't say which one is easier (probably sort). — Snps, May 11 '17 at 19:20
If the debugger initially shows nothing and you don't feel like stepping through a million cycles reading your big data, then you could simply make your program output its progress. _E.g._, accumulate the number of characters processed in each iteration and divide by total byte file size (determined at start) and output for percentage of progress. Also you could limit outputting this to every 100th iteration or so using, _e.g._, `if (count % 100 == 0)`. This way you can get a feel for how long it's gonna take and if it seems reasonable. — Snps, May 11 '17 at 19:36
i think @Slava is onto something about sort(). your statement sort((input.begin(), input.end()). may be the problem operating across vector memory allocations where the sort results in rearrangement of elements not fitting a given allocation — Dr t, May 11 '17 at 21:37

score 1 · Answer 1 · edited May 23 '17 at 10:31

1

If you use reserve(int) after allocating your vectors, performance will be much much better.

Pushing back to vectors constantly causes memory fragmentation.

The reason is that vectors are constantly outgrowing their allocated boundaries, and get reallocated often. Reallocating small objects often is expensive and causes direct impact on performance.

Calling reserve with a large enough chunk of memory initially, and calling it again when the vector's size matches its capacity, helps avoid this issue.

More here:

What is memory fragmentation?

And here:

Should I worry about memory fragmentation with std::vector?

Small demonstration with performance measurements:

#include <chrono>
#include <vector>
#include <iostream>

int main()
{
        std::vector<std::string> slow;
        std::string d = "divide and conquer";

        std::chrono::time_point<std::chrono::system_clock> start, end;
        start = std::chrono::system_clock::now();

        // I get reallocated all the time
        for ( int i=0; i < 100000; i++ )
        {
            slow.push_back(d);
        }

        end = std::chrono::system_clock::now();

        std::chrono::duration<double> elapsed_seconds = end-start;
        std::time_t end_time = std::chrono::system_clock::to_time_t(end);

        std::cout << "elapsed time v1: " << elapsed_seconds.count() << "s\n";

        start = std::chrono::system_clock::now();

        //I don't move around
        slow.reserve(100000);
        slow.clear();
        for ( int i=0; i < 100000; i++ )
        {
            slow.push_back(d);
        }

        end = std::chrono::system_clock::now();

        elapsed_seconds = end-start;
        end_time = std::chrono::system_clock::to_time_t(end);

        std::cout << "elapsed time v2: " << elapsed_seconds.count() << "s\n";
        return 0;
}

Output:

    elapsed time v1: 0.014085s

    elapsed time v2: 0.004597s

edited May 23 '17 at 10:31

Community

1
1

answered May 11 '17 at 19:04

didiz

969
11
23

I didn't down vote, nor do I disagree with calling reserve if the number of elements is known up front, but how in the world does a vector that guarantees contiguous memory cause memory fragmentation? – Christopher Pisz May 11 '17 at 19:33
1

@ChristopherPisz: Click on the link provided and find out – Lightness Races in Orbit May 11 '17 at 19:38
@BoundryImposition I did and it contained absolutely nothing in the way of explanation to back up such a claim. I wouldn't make the comment without first reading the page provided by the link. Contiguous memory and fragmentation at the same time, doesn't exactly make sense. Can you also be light while being dark and heavy while being light? You cannot be fragmented and contiguous. Perhaps they are referring to memory other than that used by the vector, but again, zero explanation provided. So, who knows what they are talking about. – Christopher Pisz May 11 '17 at 19:44
2

@ChristopherPisz: The fragmentation doesn't occur because the storage is contiguous, it occurs because you're move-expanding it all the time with frequent push-backs and no reserve (in theory). Piotr's answer there links further to a good explanation of how memory fragmentation occurs. – Lightness Races in Orbit May 11 '17 at 19:54
1

@ChristopherPisz Simple example: let's say you have a total memory of 10 bytes. You are going to push 8 bytes one by one in a vector. If you start by using reserve, the vector allocates 8 bytes and everything works out fine. If you do _not_ reserve, the vector starts out allocating, let's say, 6 bytes. Byte # 1 to 6 pushes fine, but byte # 7 doesn't fit. The vector needs to allocate another chunk that can hold the old data plus room for new, _i.e._, 6+ bytes, so it can copy the old data to the new chunk before releasing the old chunk of memory. However, theres only 4 byte left of total memory. – Snps May 11 '17 at 20:00
@snps that really has nothing to do with vector at all, its going to happen on all allocations anytime anywhere. It would be more clear to say "In order to keep allocations to a minimum for all STL containers, use reserve is possible" then to specifically call out vector. – Christopher Pisz May 11 '17 at 20:05
Sorry for not explaining more I always prefer to add a link that explains better than I can, anyway I wrote a small program demonstrating what I mean for the OP to do in order to improve performance, at 400K words one does have to think about fragmentation and it might be his only issue. – didiz May 11 '17 at 20:07
1

@ChristopherPisz Of course other types of allocations "anywhere" can cause fragmentation as well, but the argument here was wether an expanding `std::vector` causes fragmentation. You also seem to claim that contiguous memory allocations can't cause fragmentation. This is just plain wrong. Maybe you missunderstand, it's not the contiguous chunk of memory itself that gets fragmented, it's the total memory pool. – Snps May 11 '17 at 20:11
@ChristopherPisz And just to be clear, I'm not saying that memory fragmentation is gonna be a big issue on this particular problem, generally. I just kind of joined in on this argument about this particular topic. – Snps May 11 '17 at 20:21
@snps Right on! Thanks for the clarity. – Christopher Pisz May 11 '17 at 20:22
I doubt fragmentation is issue and even only issue. Main problem OP should not read all words into vector and sort, but should keep vector sorted without dups and increase counter. But it is difficult to say if OP can use `std::lower_bound` or `std::equal_range` or has to implement that. – Slava May 12 '17 at 01:38

score 1 · Accepted Answer · edited May 23 '17 at 11:54

You make your program to do way too match by memory and calculations. First you read all words into memory and sort them. Then you calculate frequencies and populate yet another vector. You should have std::vector<word_freq> on the first place, keep it sorted by word (by inserting elements into proper place) and insert new element or increase counter on existing one. Then resort this vector by frequencies and print.

For example how you could rewrite your loop:

struct word_freq {
    int freq;
    std::string word;

    word_freq( const std::string &w ) : word( w ), freq( 0 ) {}
};


void addWord( std::vector<word_freq> &v, const std::string &word )
{
     word_freq tmp( word );
     auto p = std::equal_range( v.begin(), v.end(), tmp, 
         []( const word_freq &w1, const word_freq &w2 ) {
             return w1.word < w2.word;
     } );
     if( p.first == p.second )  // not found
         p.first = v.insert( p.second, tmp ); // insert into proper place
     p.first->freq++; // increase freq counter
}

// ......
std::vector<word_freq> words;
string w;
while (inf >> w)
{
    remove_punc(w);
    addWord( words, w );
}
// here your vector sorted by words, there are no dups and counters have proper value already
// just resort it by freq and print

Details on how keep vector sorted can be found here how do you insert the value in a sorted vector?

On another side keep std::vector<word_freq> sorted will require way too match inserts into middle or beginning of the vector, which could be quite expensive and slow. So if you implement described logic and make it work on small examples and it is still too slow for your big input - you should sort vector of indexes instead of vector of word_freq itself. That will still require to insert to begining or middle of vector of integers, but such operation is significantly cheaper and faster. Details on how sort indexes instead of vector itself can be found here: compare function of sort in c++ for index sort

Word frequency program - file input too large?

2 Answers2