How to speed up counting the occurences of a word in large files?

Question

I need to count the occurrences of the string "<page>" in a 104gb file, for getting the number of articles in a given Wikipedia dump. First, I've tried this.

grep -F '<page>' enwiki-20141208-pages-meta-current.xml | uniq -c

However, grep crashes after a while. Therefore, I wrote the following program. However, it only processes 20mb/s of the input file on my machine which is about 5% workload of my HDD. How can I speed up this code?

#include <iostream>
#include <fstream>
#include <string>

int main()
{
    // Open up file
    std::ifstream in("enwiki-20141208-pages-meta-current.xml");
    if (!in.is_open()) {
        std::cout << "Could not open file." << std::endl;
        return 0;
    }
    // Statistics counters
    size_t chars = 0, pages = 0;
    // Token to look for
    const std::string token = "<page>";
    size_t token_length = token.length();
    // Read one char at a time
    size_t matching = 0;
    while (in.good()) {
        // Read one char at a time
        char current;
        in.read(&current, 1);
        if (in.eof())
            break;
        chars++;
        // Continue matching the token
        if (current == token[matching]) {
            matching++;
            // Reached full token
            if (matching == token_length) {
                pages++;
                matching = 0;
                // Print progress
                if (pages % 1000 == 0) {
                    std::cout << pages << " pages, ";
                    std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
                }
            }
        }
        // Start over again
        else {
            matching = 0;
        }
    }
    // Print result
    std::cout << "Overall pages: " << pages << std::endl;
    // Cleanup
    in.close();
    return 0;
}

Not the problem but you should read ["Why is iostream::eof inside a loop condition considered wrong?"](http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong). — Captain Obvlious, Dec 30 '14 at 00:02
Reading a line at a time might be better than a character at a time. I assume you're testing this in an optimized build, if not you should be. — Retired Ninja, Dec 30 '14 at 00:04
@RetiredNinja Wouldn't this be the same since reading a line internally requires looking at each char to find the `\n` character? — danijar, Dec 30 '14 at 00:06
@danijar I've always gotten better results reading larger chunks at a time. It may or may not help in your situation, and I don't have a spare 140gb file laying around to test it on. — Retired Ninja, Dec 30 '14 at 00:10
I would suggest @quantdev solution before trying to write something like this by hand. Getting speedups with handwritten code requires intimate knowledge of how to maximize streaming performance from spinning drives. — BlamKiwi, Dec 30 '14 at 00:32
Your bottleneck should be the file input. To increase performance, more data should be read into memory before searching. You may get some extra performance by having one thread reading in data into a buffer and another thread searching the buffer. — Thomas Matthews, Dec 30 '14 at 00:39

score 1 · Accepted Answer · answered Dec 30 '14 at 00:34

Assuming there are no insanely large lines in the file using something like

for (std::string line; std::getline(in, line); } {
    // find the number of "<page>" strings in line
}

is bound to be a lot faster! Reading each characters as a string of one character is about the worst thing you can possibly do. It is really hard to get any slower. For each character, there stream will do something like this:

Check if there is a tie()ed stream which needs flushing (there isn't, i.e., that's pointless).
Check if the stream is in good shape (except when having reached the end it is but this check can't be omitted entirely).
Call xsgetn() on the stream's stream buffer.
This function first checks if there is another character in the buffer (that's similar to the eof check but different; in any case, doing the eof check only after the buffer was empty removes a lot of the eof checks)
Transfer the character to the read buffer.
Have the stream check if it reached all (1) characters and set stream flags as needed.

There is a lot of waste in there!

I can't really imagine why grep would fail except that some line blows massively over the expected maximum line length. Although the use of std::getline() and std::string() is likely to have a much bigger upper bound, it is still not effective to process huge lines. If the file may contain lines which are massive, it may be more reasonable to use something along the lines of this:

for (std::istreambuf_iterator<char> it(in), end;
     (it = std::find(it, end, '<') != end; ) {
    // match "<page>" at the start of of the sequence [it, end)
}

For a bad implementation of streams that's still doing too much. Good implementations will do the calls to std::find(...) very efficiently and will probably check multiple characters at one, adding a check and loop only for something like every 16th loop iteration. I'd expect the above code to turn your CPU-bound implementation into an I/O-bound implementation. Bad implementation may still be CPU-bound but it should still be a lot better.

In any case, remember to enable optimizations!

Thank you, reading lines is a huge improvement. I get 115mb/s now. However, this is still results in just 25% of HDD activation time. What do you mean by massive lines? I believe that whole Wikipedia articles are stored in a single line. — danijar, Dec 30 '14 at 01:07
When a line is bigger than the available caches, it will be read into the cache, evicted, and brought back to look for the string. That's not great if it sometime happens but it is bad if it happens frequently. If the line is even bigger than main memory available, it will even be evicted from memory and read multiple times. For lines with a typical length of a few kB that's probably OK, for lines with a few MB it is probably already bad. — Dietmar Kühl, Dec 30 '14 at 01:12
Can't I just read fixed size chunks so that line length doesn't matter anymore? I'd be fine reading though `\n` like any other character. — danijar, Dec 30 '14 at 01:21
Sure, you can read fixed sized chunks. Remember, however, that you need to check if there is `""` which is split across two pages. Also, doing so will need to copy the buffer to a new location. Using `std::find()` with `std::istreambuf_iterator` can locate the the start character in the stream buffer's buffer avoiding an extra copy. — Dietmar Kühl, Dec 30 '14 at 01:25
Great, checking for tokens that are split across two chunks is no problem with my algorithm. I just have to keep the value of `matching` between chunks. Fixed sized chunks give me 140mb/s and 90% disk active time. I think that's as far as I can get, or do you have another idea? — danijar, Dec 30 '14 at 01:51
I'd still go with the use of `std::istreambuf_iterator` and use an implementation which has this operation optimized. If necessary, I'd just use my own implementation of IOStreams and optimize that operation (of course, I may have an advantage as I have already implemented my own IOStreams library...). — Dietmar Kühl, Dec 30 '14 at 02:34

Retired Ninja · Answer 2 · 2014-12-30T01:34:47.387

I'm using this file to test with: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-current1.xml-p000000010p000010000.bz2

It takes roughly 2.4 seconds versus 11.5 using your code. The total character count is slightly different due to not counting newlines, but I assume that's acceptable since it's only used to display progress.

void parseByLine()
{
    // Open up file
    std::ifstream in("enwiki-latest-pages-meta-current1.xml-p000000010p000010000");
    if(!in)
    {
        std::cout << "Could not open file." << std::endl;
        return;
    }
    size_t chars = 0;
    size_t pages = 0;
    const std::string token = "<page>";

    std::string line;
    while(std::getline(in, line))
    {
        chars += line.size();
        size_t pos = 0;
        for(;;)
        {
            pos = line.find(token, pos);
            if(pos == std::string::npos)
            {
                break;
            }
            pos += token.size();
            if(++pages % 1000 == 0)
            {
                std::cout << pages << " pages, ";
                std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
            }
        }
    }
    // Print result
    std::cout << "Overall pages: " << pages << std::endl;
}

Here's an example that adds each line to a buffer and then processes the buffer when it reaches a threshold. It takes 2 seconds versus ~2.4 from the first version. I played with several different thresholds for the buffer size and also processing after a fixed number (16, 32, 64, 4096) of lines and it all seems about the same as long as there is some batching going on. Thanks to Dietmar for the idea.

int processBuffer(const std::string& buffer)
{
    static const std::string token = "<page>";

    int pages = 0;
    size_t pos = 0;
    for(;;)
    {
        pos = buffer.find(token, pos);
        if(pos == std::string::npos)
        {
            break;
        }
        pos += token.size();
        ++pages;
    }
    return pages;
}

void parseByMB()
{
    // Open up file
    std::ifstream in("enwiki-latest-pages-meta-current1.xml-p000000010p000010000");
    if(!in)
    {
        std::cout << "Could not open file." << std::endl;
        return;
    }
    const size_t BUFFER_THRESHOLD = 16 * 1024 * 1024;
    std::string buffer;
    buffer.reserve(BUFFER_THRESHOLD);

    size_t pages = 0;
    size_t chars = 0;
    size_t progressCount = 0;

    std::string line;
    while(std::getline(in, line))
    {
        buffer += line;
        if(buffer.size() > BUFFER_THRESHOLD)
        {
            pages += processBuffer(buffer);
            chars += buffer.size();
            buffer.clear();
        }
        if((pages / 1000) > progressCount)
        {
            ++progressCount;
            std::cout << pages << " pages, ";
            std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
        }
    }
    if(!buffer.empty())
    {
        pages += processBuffer(buffer);
        chars += buffer.size();
        std::cout << pages << " pages, ";
        std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
    }
}

How to speed up counting the occurences of a word in large files?

2 Answers2