Reading key-value pairs as fast as possible in C++ from file

Question

I have a file with roughly 2 million lines like this:

2s,3s,4s,5s,6s 100000
2s,3s,4s,5s,8s 101
2s,3s,4s,5s,9s 102

The first comma separated part indicates a poker result in Omaha, while the latter score is an example "value" of the cards. It is very important for me to read this file as fast as possible in C++, but I cannot seem to get it to be faster than a simple approach in Python (4.5 seconds) using the base library.

Using the Qt framework (QHash and QString), I was able to read the file in 2.5 seconds in release mode. However, I do not want to have the Qt dependency. The goal is to allow quick simulations using those 2 million lines, i.e. some_container["2s,3s,4s,5s,6s"] to yield 100 (though if applying a translation function or any non-readable format will allow for faster reading that's okay as well).

My current implementation is extremely slow (8 seconds!):

std::map<std::string, int> get_file_contents(const char *filename)
{
    std::map<std::string, int> outcomes;
    std::ifstream infile(filename);

    std::string c;
    int d;

    while (infile.good())
    {
        infile >> c;
        infile >> d;
        //std::cout << c << d << std::endl;
        outcomes[c] = d;
    }
    return outcomes;
}

What can I do to read this data into some kind of a key/value hash as fast as possible?

Note: The first 16 characters are always going to be there (the cards), while the score can go up to around 1 million.

Some further informations gathered from various comments:

sample file: http://pastebin.com/rB1hFViM
ram restrictions: 750MB
initialization time restriction: 5s
computation time per hand restriction: 0.5s

Why are you storing this data as text if quick access is such an important issue? — Kerrek SB, May 28 '14 at 22:05
@KerrekSB I have no idea about alternatives unfortunately. Note that it is for a friendly competition; the solution has to be standalone and cannot be connected to a database. — PascalVKooten, May 28 '14 at 22:06
Can you afford to do a (potentially slow) setup first, or does the entire implementation have to run fast? std::map works by keeping its data sorted internally, maybe not the best container for your needs. — Matt Coubrough, May 28 '14 at 22:10
So, the file generation is out of your control? The fastest data structure that you can use in this case is an unordered_map with pre-allocated chuncks. — Ian Medeiros, May 28 '14 at 22:10
Store the data in a fixed width binary representation if you can. — Kerrek SB, May 28 '14 at 22:16
@MattCoubrough We're talking about a poker bot (no worries, it's concerning a competition like I said). I'll have 5 seconds setup time, and then 0.5 seconds per move to make, so I won't have a lot of options with respect to slow at first. — PascalVKooten, May 28 '14 at 22:16
@KerrekSB An example file containing more examples: http://pastebin.com/rB1hFViM — PascalVKooten, May 28 '14 at 22:19
@KerrekSB I didn't study CS; any pointers (pun intended) on how to store this kind of data as a "fixed width binary representation"? — PascalVKooten, May 28 '14 at 22:22
@PascalvKooten: No need to study CS to be a programmer. I recommend watching [this video series](http://www.youtube.com/playlist?list=PLHxtyCq_WDLXFAEA-lYoRNQIezL_vaSX-). — Kerrek SB, May 28 '14 at 22:30
OK, let's set this straight: You have 52 cards, so you want a map from "5-small-integers" to `int`. — Kerrek SB, May 28 '14 at 22:43
So in the very naive setup, you could store one card per byte, for 40 bits in total, leaving you another 24 bits for the score, so you have 64-bit records. If you want to get tighter, you only need 6 bits per card really, so 30 bits, plus however much you need for the score. Tighter yet, you can work out that there are only 52-choose-5 hands, which is 2598960. If you enumerate every hand, you only need 22 bits to store that, and if the score fits in 10 bits, you can get away with 32-bit records. Then your 2M samples would fit into 8MB of memory. — Kerrek SB, May 28 '14 at 22:51
@KerrekSB It sounds really great! I just have no idea how to do it. Even already knowing how to do those calculations is great. Are these topics handled in those videos? — PascalVKooten, May 28 '14 at 23:00
@PascalvKooten: I doubt it (though I haven't made it through all 15 hours yet) - the videos are more of a philosophical discovery that it's "OK to be a programmer" and socially acceptable to know maths... you'll probably have to do some learning yourself. There are plenty of good books, though, many of which are mentioned in the videos. — Kerrek SB, May 28 '14 at 23:07
@PascalvKooten if you want answers more complex than the ones that were provided, I suggest you to offer a big bounty. It's a cool problem to solve, but it's not so trivial as you think. — Ian Medeiros, May 30 '14 at 14:08

score 4 · Answer 1 · edited May 23 '17 at 12:05

As I see it, there are two bottlenecks on your code.

1 Bottleneck

I believe that the file reading is the biggest problem there. Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read in a single operation (which is very fast), but you can even map the file in memory if your OS supports it. Here is a link that's very informative on how to use memory mapped files.

2 Bottleneck

The std::map is usually implemented with a self-balancing BST that will store all the data in order. This makes the insertion to be an O(logn) operation. You can change it to std::unordered_map, wich uses a hash table instead. A hash table have a constant time insertion if the number of colisions are low. As the ammount of elements that you need to read is known, you can reserve a suitable ammount of chuncks before inserting the elements. Keep in mind that you need more chuncks than the number of elements that will be inserted in the hash to avoid the maximum ammount of colisions.

Thanks. I'm just wondering how you read in a file at once given that it should be read in a key,value combination? It doesn't sound like a vector has anything to do with it? — PascalVKooten, May 28 '14 at 22:39
Map it to memory and populate the unordered_map from there. The disk->memory transfer is your biggest problem. By this I mean: you will read the content from a array on memory, or from an std:istream directly. Using the std::ifstream to parse the text will be really slow. — Ian Medeiros, May 28 '14 at 22:40

Kerrek SB · Answer 2 · 2014-05-29T00:59:50.167

2

A simple idea might be to use the C API, which is considerably simpler:

#include <cstdio>

int n;
char s[128];

while (std::fscanf(stdin, "%127s %d", s, &n) == 2)
{
    outcomes[s] = n;
}

A rough test showed a considerable speedup for me compared to the iostreams library.

Further speedups may be achieved by storing the data in a contiguous array, e.g. a vector of std::pair<std::string, int>; it depends on whether your data is already sorted and how you need to access it later.

For a serious solution, though, you should probably step back further and think of a better way to represent your data. For example, a fixed-width, binary encoding would be much more space-efficient and faster to parse, since you won't need to look ahead for line endings or parse strings.

Update: From some quick experimentation I've found it fairly fast to first read the entire file into memory and then perform alternating strtok calls with either " " or "\n" as the delimiter; whenever a pair of calls succeed, apply strtol on the second pointer to parse the integer. Here's a skeleton:

#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>

int main()
{
    std::vector<char> data;

    // Read entire file to memory
    {
        data.reserve(100000000);

        char buf[4096];
        for (std::size_t n; (n = std::fread(buf, 1, sizeof buf, stdin)) > 0; )
        {
            data.insert(data.end(), buf, buf + n);
        }
        data.push_back('\0');
    }

    // Tokenize the in-memory data
    char * p = &data.front();
    for (char * q = std::strtok(p, " "); q; q = std::strtok(nullptr, " "))
    {
        if (char * r = std::strtok(nullptr, "\n"))
        {
            char * e;
            errno = 0;
            int const n = std::strtol(r, &e, 10);
            if (*e != '\0' || errno != 0) { continue; }

            // At this point we have data:
            // * the string is "q"
            // * the integer is "n"
        }
    }
}

edited May 29 '14 at 00:59

answered May 28 '14 at 22:23

Kerrek SB

428,875
83
813
1,025

Yes a binary representation in the file that is already in a searchable and sorted format would be the way to go (if permissable). NB. If you do end up using vector make sure you reserve() the size required, as the resizing of the vector requires lots of data shifting and is a performance killer. – Matt Coubrough May 28 '14 at 22:30
To extend on the goal: I will run simulations later on where I will know my cards, and I'll sample many random enemy hands (subtracting the cards that I hold already) to determine who has the higher score given the cards on the table. I am not sure if sorting the values here will help, as depending on which cards I have, somewhere in the vector there would be gaps? – PascalVKooten May 28 '14 at 22:33
Also, I'm a bit confused by a vector of pairs. Searching a vector for the key will be slow? – PascalVKooten May 28 '14 at 22:40
@PascalvKooten: Well, storing data in a map is slower than in a vector, but you get it in sorted order with fast lookups. By contrast, sorting a vector with millions of elements by string comparison may be expensive. So choose whichever representation suits your access pattern best. – Kerrek SB May 28 '14 at 22:41
@PascalvKooten: With an unsorted vector, you have to do a linear search, and since you're doing string comparison, you that touches a lot of random memory. (Of course the real solution is not to have a string key.) – Kerrek SB May 28 '14 at 22:42
Does it suffice to just open the file and then write to a binary format? – PascalVKooten May 28 '14 at 22:47
You could allocate a 100MB buffer and read in the entire file in binary mode (to avoid any conversion overhead). You'd then have to parse the buffer, but the I/O overhead would be reduced. Depending on the OS, you can determine the file size before reading it, but 100MB should be enough (50 bytes per line), unless the file is unicode, in which case allocate 200MB. – rcgldr May 28 '14 at 23:23
@rcgldr: Hm, I [came across this problem](http://stackoverflow.com/questions/23923924/why-is-glibc-sscanf-vastly-slower-than-fscanf-on-linux)... – Kerrek SB May 29 '14 at 00:29
The issue here is sscanf is being used to search for newlines. Instead use memchr() to find newlines, and only use sscanf to translate numeric strings into integers (or perhaps atoi()). – rcgldr May 29 '14 at 01:34
@rcgldr: I think the issue is a bit more subtle. In principle, `sscanf` only needs to scan forward as long as it has to do conversions; that's the same cost no matter which approach you take. – Kerrek SB May 29 '14 at 08:46
The issue in the link is that sscanf was searching for a null character on a large buffer. Using memchar to find newlines and replacing them with null characters (processing one line at a time) should resolve that issue. – rcgldr May 29 '14 at 10:01
@rcgldr: right, but that's problem of that particular implementation. It's not the scan that searches for the terminator, but the ambient string-to-file wrapper. – Kerrek SB May 29 '14 at 10:45

vlad_tepesch · Answer 3 · 2014-05-28T23:14:15.740

Ian Medeiros already mentioned the two major botlenecks.

a few thoughts about data structures:

the amount of different cards is known: 4 colors of each 13 cards -> 52 cards. so a card requires less than 6 bits to store. your current file format currently uses 24 bit (includig the comma). so by simply enumerating the cards and omitting the comma you can save ~2/3 of file size and allows you to determine a card with reading only one character per card. if you want to keep the file text based you may use a-m, n-z, A-M and N-Z for the four colors.

another thing that bugs me is the string based map. string operations are innefficient. One hand contains 5 cards. that means 52^5 posiibilities if we keep it simple and do not consider the already drawn cards.

--> 52^5 = 380.204.032 < 2^32

that means we can enumuerate every possible hand with a uint32 number. by defining a special sorting scheme of the cards (since order is irrelevant), we can assign a number to the hand and use this number as key in our map that is a lot faster than using strings.

if we have enough memory (1.5 GB) we do not even need a map but we can simply use an array. of course the most cells are unused but access may be very fast. we even can ommit the ordering of the cards since the cells are present independet if we fill them or not. So we can use them. but in this case you should not forget to fill all possible permutations of the hand read from the file.

with this scheme we also (may be) can further optimize our file reading speed. if we only store the hands number and the rating so that only 2 values need to be parsed.

infact we can optimize the required storage space by using a more complex adressing scheme for the different hands, since in reality there are only 52*51*50*49*48 = 311.875.200 possible hands.additional to that the ordering is irrelevant as mentioned but i think that this saving is not worth the increased complexity of the encoding of the hands.

Very good breakdown of the whole problem. I think that if him do his research around all the given answers him will get a very good implementation. — Ian Medeiros, May 29 '14 at 13:21
Sorry to mention this afterwards, but for the whole implemenation I'm limited by using 750mb ram for running the whole program. So let's consider 600mb ram to be the max. — PascalVKooten, May 29 '14 at 14:25

Reading key-value pairs as fast as possible in C++ from file

3 Answers3

Linked