10

I would like to read an file into a string. I am looking for different ways for how to do it efficiently.

Using a fixed size *char buffer

I have received an answer from Tony what creates a 16 kb buffer and reads into that buffer and appends the buffer till there is nothing more to read. I understand how it works and I found it very fast. What I don't understand is that in the comments of that answer it is said that this way copies everything twice. But as I understand it, it only happens in the memory, not from the disk, so it is almost unnoticable. Is it a problem that it copies from the buffer to the string in the memory?

Using istreambuf_iterator

The other answer I received uses istreambuf_iterator. The code looks beautiful and minimal, but it is extremely slow. I don't know why does it happen. Why are those iterators so slow?

Using memcpy()

For this question I received comments that I should use memcpy() as it is the fastest native method. But how can I use memcpy() with a string and an ifstream object? Isn't ifstream supposed to work with its own read function? Why does using memcpy() ruin portability? I am looking for a solution which is compatible with VS2010 as well as GCC. Why would memcpy() not work with those?

+ Any other efficient way possible?

What do you recommend, what shell I use, for small < 10 MB binary files?

(I did not want to split this question in parts, as I am more interested in the comparison between the different way how can I read an ifstream into a string)

Community
  • 1
  • 1
hyperknot
  • 12,019
  • 22
  • 87
  • 143
  • memcpy() comment refers to reading using memory-mapped file, not reading using istream. Memory-mapped file is not portable because it depends on OS API. – Kien Truong Jun 28 '11 at 17:59
  • When you're measuring performance, are you doing it in release or debug mode? Do you have optimizations turn on? Do you have iterator checking turned off? By default visual studio has extra-standard iterator checking that can hurt performance. – luke Jun 28 '11 at 18:12
  • possible duplicate of [how to pre-allocate memory for a std::string object](http://stackoverflow.com/questions/3303527/how-to-pre-allocate-memory-for-a-stdstring-object/3304059#3304059)? Perhaps the most exact duplicate I've seen yet. The entire first sentence is virtually identical (the sole difference being "I need to..." vs. "I would like to...") – Jerry Coffin Jun 28 '11 at 18:39

2 Answers2

10

it only happens in the memory, not from the disk, so it is almost unnoticable

That is indeed correct. Still, a solution that doesn’t do that may be faster.

Why are those iterators so slow?

The code is slow not because of the iterators but because the string doesn’t know how much memory to allocate: the istreambuf_iterators can only be traversed once so the string is essentially forced to perform repeated concatenations with resulting memory reallocations, which are very slow.

My favourite one-liner, from another answer is streaming directly from the underlying buffer:

string str(static_cast<stringstream const&>(stringstream() << in.rdbuf()).str());

On recent platforms this will indeed pre-allocate the buffer. It will however still result in a redundant copy (from the stringstream to the final string).

Community
  • 1
  • 1
Konrad Rudolph
  • 482,603
  • 120
  • 884
  • 1,141
  • 1
    I was just timing different solutions, and yours is about 8 times faster than all the iterator based ones. Very good one. – Björn Pollex Jun 28 '11 at 18:20
4

The most general way would be probably be the response using the istreambuf_iterator:

std::string s( (std::istreambuf_iterator<char>( source )),
               (std::istreambuf_iterator<char>()) );

Although exact performance is very dependent on the implementation, it's highly unlikely that this is the fastest solution.

An interesting alternative would be:

std::istringstream tmp;
tmp << source.rdbuf();
std::string s( tmp.str() );

This could be very rapid, if the implementation has do a good job on the operator<< you're using, and in how it grows the string within the istringstream. Some earlier implementations (and maybe sone more recent ones as well) were very bad at this, however.

In general, performance using an std::string will depend on how efficient the implementation is in growing a string; the implementation cannot determine how large to make it initially. You might want to compare the first algorithm using the same code with std::vector<char> instead of std::string, or if you can make a good estimate of the maximum size, using reserve, or something like:

std::string s( expectedSize, '\0' );
std::copy( std::istreambuf_iterator<char>( source ),
           std::istreambuf_iterator<char>(),
           s.begin() );

memcpy cannot read from a file, and with a good compiler, will not be as fast as using std::copy (with the same data types).

I tend to use the second solution, above, with the << on the rdbuf(), but that's partially for historical reasons; I got used to doing this (using istrstream) before the STL was added to the standard library. For that matter, you might want to experiment with istrstream and a pre-allocated buffer (supposing you can find an appropriate size for the buffer).

James Kanze
  • 142,482
  • 15
  • 169
  • 310
  • If the source stream is seekable, you can get its size by doing `source.seekg(0,std::ios_base::end); std::streampos pos=source.tellg(); source.seekg(0,std::ios_base::beg);`. After this, if `source` is still Ok and `pos!=-1`, `pos` will be, e.g., the size of a file. I have used this in the past. – sbi Jun 28 '11 at 18:43
  • @sbi That will work on most Unix implementations, but not on Windows, at least if the file is opened in text mode. And it's not guaranteed to even compile. – James Kanze Jun 29 '11 at 07:24
  • @James: Can you elaborate? I know I used it in a cross-platform app, and I think it worked on Win32, OSX, BSD, Linux, Solaris, and some others. – sbi Jun 29 '11 at 08:22
  • @sbi For starters, `std::streampos` is an implementation defined type which is not necessarily convertible to an integral type. And even when it is convertible (it must be a class type), there is no guaranteed relationship between the numeric value of the integer and anything else---it could be a magic cookie. Finally, you don't define size, but in this case, what is wanted is the number of characters which will be read before `EOF`. And for that definition, it doesn't work under Windows unless the file is opened in binary mode. – James Kanze Jun 29 '11 at 12:56
  • @James: Thanks. `std::streampos` not being convertible or its value not conveying any meaning might indeed be a show stopper. I didn't know about that. As for what's considered the size: Is the value reported by `tellg()` not in the same way binary/text that streaming is? (However, even if it isn't, usually this will be about 10% of the files size. It might thus result in one additional allocation rather than an arbitrary amount.) – sbi Jun 29 '11 at 13:31
  • @sbi On Windows, the value reported by `tellg()` will be the same for any given file, regardless of whether it is opened in text mode or in binary. The number of characters you can read will will not be the same, however. For the purpose of determining buffer size, it might be sufficient, since the number of characters you can read will always be less than or equal to the results of `tellg()`. Typically, for text mode files, `tellg()` won't be too much bigger than what you can read, but it can be significantly different. – James Kanze Jun 29 '11 at 15:35
  • by default `std::istreambuf_iterator` skips white characters, you need to reconfigure stream to prevent this. [Demo](https://godbolt.org/z/Yb1WTb). – Marek R Dec 08 '20 at 15:14