1

I'm trying to read the data contained in a .dat file with size ~1.1GB. Because I'm doing this on a 16GB RAM machine, I though it would have not be a problem to read the whole file into memory at once, to only after process it.

To do this, I employed the slurp function found in this SO answer. The problem is that the code sometimes, but not always, throws a bad_alloc exception. Looking at the task manager I see that there are always at least 10GB of free memory available, so I don't see how memory would be an issue.

Here is the code that reproduces this error

#include <iostream>
#include <fstream>
#include <sstream>
#include <string>

using namespace std;

int main()
{
    ifstream file;
    file.open("big_file.dat");
    if(!file.is_open())
        cerr << "The file was not found\n";

    stringstream sstr;
    sstr << file.rdbuf();
    string text = sstr.str();

    cout << "Successfully read file!\n";
    return 0;
}

What could be causing this problem? And what are the best practices to avoid it?

Community
  • 1
  • 1
glS
  • 1,027
  • 4
  • 16
  • 41
  • Where exactly does it throw the `bad_alloc`? – mindriot Feb 06 '16 at 11:44
  • 1
    Are you building a 32 bit executable or a 64 bit executable ? – Paul R Feb 06 '16 at 11:45
  • 2
    [This SO post](http://stackoverflow.com/a/3486163/3233921) features a small program to find the largest allocatable contiguous block of memory. This can be less than the actual amount of memory you have. – mindriot Feb 06 '16 at 11:49
  • @mindriot it throws the bad_alloc on the line where the data is assigned to the `string` variable. Thanks for the link, I'll check it out. – glS Feb 06 '16 at 11:52
  • 2
    @PaulR I'm using the "standard" MinGW distribution (*not* MinGW-w64), so I guess I'm building a 32 bit executable. I thought this would have not be a problem in this case, though... is it? – glS Feb 06 '16 at 11:54
  • you would do better by splitting up the contents of the file into smaller pieces instead of trying to keep the whole file in one contiguous memory block, e.g. in several std::array<> – AndersK Feb 06 '16 at 11:56
  • `string text = sstr.str();` duplicates the entire block in memory, this is extremely inefficient if you're reading a gigabyte at a time. Consider not using a second stream for this purpose. – M.M Feb 06 '16 at 12:02
  • 2
    @glS, yes, yes it is. 32bit processes can only have 2GB or memory space, and you need twice your file size. This can not really work, and the fact that you say it sometimes works indicates a bug in MinGW/Windows/MinGW's glibc. – Marcus Müller Feb 06 '16 at 14:10

3 Answers3

4

The fact that your system has 16GB doesn't mean any program at any time can allocate a given amount of memory. In fact, this might work on a machine that has only 512MB of physical RAM, if enought swap is available, or it might fail on a HPC node with 128GB of RAM – it's totally up to your Operating System to decide how much memory is available to you, here.

I'd also argue that std::string is never the data type of choice if actually dealing with a file, possibly binary, that large.

The point here is that there is absolutely no knowing how much memory stringstream tries to allocate. A pretty reasonable algorithm would double the amount of memory allocated every time the allocated internal buffer becomes too small to contain the incoming bytes. Also, libc++/libc will probably also have their own allocators that will have some allocation overhead, here.

Note that stringstream::str() returns a copy of the data contained in the stringstream's internal state, again leaving you with at least 2.2 GB of heap used up for this task.

Really, if you need to deal with data from a large binary file as something that you can access with the index operator [], look into memory mapping your file; that way, you get a pointer to the beginning of the file, and might work with it as if it was a plain array in memory, letting your OS take care of handling the underlying memory/buffer management. It's what OSes are for!

If you didn't know Boost before, it's kind of "the extended standard library for C++" by now, and of course, it has a class abstracting memory mapping a file: mapped_file.

The file I'm reading contains a series of data in ASCII tabular form, i.e. float1,float2\nfloat3,float4\n....

I'm browsing through the various possible solutions proposed on SO to deal with this kind of problem, but I was left wondering on this (to me) peculiar behaviour. What would you recommend in these kinds of circumstances?

Depends; I actually think the fastest way of dealing with this (since file IO is much, much slower than in-memory parsing of ASCII) is to parse the file incrementally, directly into an in-memory array of float variables; possibly taking advantage of your OS'es pre-fetching SMP capabilities in that you don't even get that much of a speed advantage if you'd spawn separate threads for file reading and float conversion. std::copy, used to read from std::ifstream to a std::vector<float> should work fine, here.

I'm still not getting something: you say that file IO is much slower than in-memory parsing, and this I understand (and is the reason why I wanted to read the whole file at once). Then you say that the best way is to parse the whole file incrementally into an in-memory array of float. What exactly do you mean by this? Doesn't this mean to read the file line-by-line, resulting in a large number of file IO operations?

Yes, and no: First, of course, you will have more context switches then you'd have if you just ordered for the whole to be read at once. But those aren't that expensive -- at least, they're going to be much less expensive when you realize that most OSes and libc's know quite well how to optimize reads, and thus will fetch a whole lot of file at once if you don't use extremely randomized read lengths. Also, you don't infer the penalty of trying to allocate a block of RAM at least 1.1GB in size -- that calls for some serious page table lookups, which aren't that fast, either.

Now, the idea is that your occasional context switch and the fact that, if you're staying single-threaded, there will be times when you don't read the file because you're still busy converting text to float will still mean less of a performance hit, because most of the time, your read will pretty much immediately return, as your OS/runtime has already prefetched a significant part of your file.

Generally, to me, you seem to be worried about all the wrong kinds of things: Performance seems to be important to you (is it really that important, here? You're using a brain-dead file format for interchanging floats, which is both bloaty, loses information, and on top of that is slow to parse), but you'd rather first read the whole file in at once and then start converting it to numbers. Frankly, if performance was of any criticality to your application, you would start to multi-thread/-process it, so that string parsing could already happen while data is still being read. Using buffers of a few kilo- to Megabytes to be read up to \n boundaries and exchanged with a thread that creates the in-memory table of floats sounds like it would basically reduce your read+parse time down to read+non-measurable without sacrificing read performance, and without the need for Gigabytes of RAM just to parse a sequential file.

By the way, to give you an impression of how bad storing floats in ASCII is:

The typical 32bit single-precision IEEE753 floating point number has about 6-9 significant decimal digits. Hence, you will need at least 6 characters to represent these in ASCII, one ., typically one exponential divider, e.g. E, and on average 2.5 digits of decimal exponent, plus on average half a sign character (- or not), if your numbers are uniformly chosen from all possible IEEE754 32bit floats:

-1.23456E-10

That's an average of 11 characters.

Add one , or \n after every number.

Now, your character is 1B, meaning that you blow up your 4B of actual data by a factor of 3, still losing precision.

Now, people always come around telling me that plaintext is more usable, because if in doubt, the user can read it… I've yet to see one user that can skim through 1.1GB (according to my calculations above, that's around 90 million floating point numbers, or 45 million floating point pairs) and not go insane.

Marcus Müller
  • 27,924
  • 4
  • 40
  • 79
  • thanks for the info. The file I'm reading contains a series of data in ASCII tabular form, i.e. `float1,float2\nfloat3,float4\n...`. I'm browsing through the various possible solutions proposed on SO to deal with this kind of problem, but I was left wondering on this (to me) peculiar behaviour. What would you recommend in these kinds of circumstances? – glS Feb 06 '16 at 11:58
  • I'm still not getting something: you say that file IO is much slower than in-memory parsing, and this I understand (and is the reason why I wanted to read the whole file at once). Then you say that the best way is to parse the whole file incrementally into an in-memory array of float. What exactly do you mean by this? Doesn't this mean to read the file line-by-line, resulting in a large number of file IO operations? Regarding Boost, I knew it but for what I understood it is not easy to make it work with MinGW under Windows – glS Feb 06 '16 at 12:27
  • I've used Boost with MinGW. Was like any other library. *shrugs* not really hard to use; `bootstrap.bat mingw` in the Boost source tree, then `b2 toolset=gcc`, building, and using that then was all I had to do, I think. – Marcus Müller Feb 06 '16 at 13:49
  • Of course you are right regarding the ASCII format not being the best suited to the purpose. This is however one of those circumstances in which I was not the one to decide and/or write it. The performance is an issue in my case to the extent in which I would like the processing to end within a reasonable amount of time, but I am still curious about what the most efficient way is, even if not crucial to the purpose at hand. I got boost to work on my system so I'm going to try the memory-mapping approach. – glS Feb 06 '16 at 15:07
  • to be clear: your proposed approach is then to use `ifstream` to sequentially read the values into `float\double` variables, right? Something like `double v1,v2; ifs >> v1; ifs.ignore(); ifs >> v2; ifs.ignore();` to get the couple of values in each row – glS Feb 06 '16 at 15:08
  • Also, unless your ASCII "float"s have more than 8 digits, use `float`, not `double`, and save a lot of space on storage :) You could also use `boost::tokenizer` to not have to `.ignore()` single characters. – Marcus Müller Feb 06 '16 at 16:45
2

In a 32 bit executable, total memory address space is 4gb. Of that, sometimes 1-2 gb is reserved for system use.

To allocate 1 GB, you need 1 GB of contiguous space. To copy it, you need 2 1 GB blocks. This can easily fail, unpredictably.

There are two approaches. First, switch to a 64 bit executable. This will not run on a 32 bit system.

Second, stop allocating 1 GB contiguous blocks. Once you start dealing with that much data, segmenting it and or streaming it starts making a lot of sense. Done right you'll also be able to start to process it prior to finishing reading it.

There are many file io datastructures, from stxxl to boost, or you can roll your own.

Yakk - Adam Nevraumont
  • 235,777
  • 25
  • 285
  • 465
1

The size of the heap (a pool of memory used for dynamic allocations) is limited independently on the amount of RAM your machine has. You should use some other memory allocation technique for such large allocations which will probably force you to change the way you read from the file.

If you are running on UNIX based system you can check the function vmalloc or the VirtualAlloc function if you are running on Windows platform.

Daniel Lahyani
  • 676
  • 7
  • 14