4

I have a relatively large file that I needed to ensure contained only unique lines. The file is only 500MB. I understand that there is plenty of overhead, but I was seeing nearly 5GB of RAM usage. I could have done this using an external merge sort and maintain a small amount of RAM, but this seemed faster to code.

I am using VC++14.

#include <string>
#include <vector>
#include <fstream>
#include <iostream>
#include <algorithm>
#include <unordered_set>

using std::vector;
using std::string;
using std::unordered_set;

class uniqify {
    unordered_set<string> s;
public:
    auto exists(const string &filename) const -> bool {
        std::ifstream fin(filename);
        bool good = fin.good();
        return fin.close(), good;
    }

    void read(const string &filename) {
        std::ifstream input(filename);
        string line;
        while (std::getline(input, line))
            if (line.size())
                s.insert(line);
    }

    void write(const string &filename) const {
        std::ofstream fout(filename);
        for (auto line : s)
            fout << line << "\n";
        fout.close();
    }
};

int main(int argc, char **argv) {
    uniqify u;
    string file("file.txt");
    if(u.exists(file))
        u.read(file);
    u.write("output_file.txt");
    return 0;
}

What causes the RAM to skyrocket over 10x as much?

M.M
  • 130,300
  • 18
  • 171
  • 314
Goodies
  • 3,526
  • 2
  • 22
  • 45
  • 3
    "*The file is only 500MB.*" You say "only" as though this was a small file. Also, how many lines are in it? – Nicol Bolas May 12 '16 at 04:19
  • Might want to have a look at what's being allocated using your debugger or memory analyser. – tadman May 12 '16 at 04:20
  • At the end of `read()`, print `s.bucket_count()` and `s.size()`. What are the values? You might want to `s.reserve(...something large enough...)` if maximum performance is wanted. – doug65536 May 12 '16 at 04:24
  • 2
    @NicolBolas I'd say, if it fits on a CD, it is a small file, these days ;-). And large files are those you can't store on FAT32 USB sticks (4GB). – hyde May 12 '16 at 04:35
  • @NicolBolas 500MB is "only" in the sense that it takes almost ten times as much space in memory. There are around 55,000,000 lines I believe. Each line is roughly 1-20 characters of varied lengths. – Goodies May 12 '16 at 04:48
  • 3
    500MB/55M is 9 bytes in average. Plus 1 `'\0'` for tailing the string. So your real data is 10 bytes. Each set node contains at least 3 pointers which will be 24 bytes and 1 bytes for the colour. For each node, your data type is string which might also take another 24 bytes for managing your real 10 bytes data. That's 59 bytes and you also need to take alignment into account which is roughly 7(for the colour) + 4 (for each string in average). That's 70 bytes. I think the memory allocation may also need some additional bytes for book keeping. That will end up close to your result. – ZHANG Zikai May 12 '16 at 05:00
  • @ZHANGZikai for an *ordered* set (`std::set`) that would mostly make sense, but that isn't what the OP is using. – WhozCraig May 12 '16 at 05:11
  • To use less memory than `string`, you could read the file into a flat buffer and store pointers into the buffer – M.M May 12 '16 at 05:20
  • @WhozCraig Ops, you are right. Thanks. – ZHANG Zikai May 12 '16 at 05:29
  • For this particular problem, you can try [Trie](https://en.wikipedia.org/wiki/Trie). It merges all duplicated bytes and the memory cost could be less than your input file. It's not as simple to code as this one though. – ZHANG Zikai May 12 '16 at 05:34
  • @M.M I'd still have to compare them, though. Set wouldn't do that afaik. Example: http://cpp.sh/46wg output still shows a size of 2 despite them being identical. I'm not looking for just a container. – Goodies May 12 '16 at 05:41
  • Pipe the file through well-known Unix tools `sort` and `uniq`. That said, the approach of @ZHANGZikai isn't totally off, the overhead for storing a few bytes in a string is relatively large. Having a fixed-size buffer (as some `std::string` offer as add-on) would help, also compiling for 32 bit would help, although it might prove to be slightly too large for a 32-bit process. – Ulrich Eckhardt May 12 '16 at 06:00
  • @Goodies What MM said, in combination with an in-memory sort and stripping unique values can certainly be done without too much added code. [something like this](http://pastebin.com/grcHQatH), which should use considerably less memory than you were facing. But if it really is mammoth amounts of input data (I don't consider 500MB to be "small"; call me old-school). I would seriously consider using something similar to Ulrich's process-piping suggestion. You're on Windows, so you're somewhat limited in the nifty toolset availability, however. – WhozCraig May 12 '16 at 06:32

2 Answers2

12

An unordered_set is a node-based container. Last time I checked, MSVC uses a doubly linked list to store the elements, and a vector of iterators into that linked list to delineate the buckets. Default max_load_factor() of unordered_set is 1, so there are at least many buckets as nodes. And it stores roughly one list iterator - which is one pointer - per bucket. So for each node you have two pointers' worth of overhead from the doubly linked list, plus at least one pointer from the bucket, for three pointers total.

Then std::string adds its own overhead on top. MSVC's std::string is, I believe, two pointers + 16 bytes SSO buffer. Strings longer than 15 characters will use dynamic allocation, which costs more.

So each string in the set costs at least 5 pointers + 16 bytes SSO buffer, at 8 bytes per pointer that's 56 bytes per string minimum. With 55M strings that's about 3GB right there. And we haven't counted longer-than-15-character strings, nor the per-node memory allocation overhead, which can easily bring it up to 5GB.

Ulrich Eckhardt
  • 15,392
  • 1
  • 25
  • 49
T.C.
  • 123,516
  • 14
  • 264
  • 384
2

There is overhead involved with the datastrctures regardless of which implementation is provided by the vendor of your C++ compiler.

If you follow the discussion in this question other of similar nature you will find that most vendors will likely use hash tables to implement the unordered set, and hash tables need to be re-sized and grow in funny ways where if you have significant number of entries added dynamically. You should allocate the table to the right size up front rather than counting on dynamic re-sizing.

However, this is just a guess since I don't know what implementation is used in your system.

Community
  • 1
  • 1
Soren
  • 13,623
  • 4
  • 34
  • 66