9

Many other posts, like " Read whole ASCII file into C++ std::string " explain what some of the options are but do not describe pro and cons of various methods in any depth. I want to know why one method is preferable over another?

All of these use std::fstream to read the file into a std::string. I am unsure what the costs and benefits of each method. Lets assume this is for the common case where the read files are known to be of some smallish size memory can easily accommodate, clearly reading a multi-terrabyte file into an memory is a bad idea no matter how you do it.

The most common way after a few googles searches to read a whole file into an std::string involves using std::getline and appending a newline character to it after each line. This seems needless to me, but is there some performance or compatibility reason that this is ideal?

std::string Results;
std::ifstream ResultReader("file.txt");    
while(ResultReader)
{
    std::getline(ResultReader, Results);
    Results.push_back('\n');
}

Another way I pieced together is to change the getline delimiter so it is something not in the file. The EOF char is seems unlikely to be in the middle of the file so that seems a likely candidate. This includes a cast so there is at least one reason not to do it, but this does read a file at once with no string concatenation. Presumably there is still some cost for the delimiter checks. Are there any other good reasons not to do this?

std::string Results;
std::ifstream ResultReader("file.txt");
std::getline(ResultReader, Results, (char)std::char_traits<char>::eof());

The cast means that on systems that define std::char_traits::eof() as something other than -1 might have problems. Is this a practical reason to not choose this over other methods that use std::getline and string::push_pack('\n').

How does these compare to other ways of reading the file at once like in this question: Read whole ASCII file into C++ std::string

std::ifstream ResultReader("file.txt");
std::string Results((std::istreambuf_iterator<char>(ResultReader)),
                     std::istreambuf_iterator<char>());

It would seem this would be best. It offloads almost all the work onto the standard library which ought to be heavily optimized for the given platform. I see no reason for checks other than stream validity and the end of the file. Is this ideal or are there problems with this that are unseen.

Does the standard or do details of some implementation provide reasons to prefer some method over another? Have I missed some method that might prove ideal in a wide variety of circumstances?

What is a simplest, most idiomatic, best performing and standard compliant way of reading a whole file into an std::string?

EDIT - 2 This question has prompted me to write a small suite of benchmarks. They are MIT license and available on github at: https://github.com/Sqeaky/CppFileToStringExperiments

Fastest - TellSeekRead and CTellSeekRead- These have the system provide an easy to get the size and reads the file in one go.

Faster - Getline Appending and Eof - The checking of chars does not seem to impose any cost.

Fast - RdbufMove and Rdbuf - The std::move seems to make no difference in release.

Slow - Iterator, BackInsertIterator and AssignIterator - Something is wrong with iterators and input streams. The work great in memory, but not here. That said some of these are faster than others.

I have added every method suggested so far, including those in links. I would appreciate if someone could run this on windows and with other compilers. I currently do not have access to a machine with NTFS and it has been noted that this and compiler details could be important.

As for measuring simplicity and idiomatic-ness how do we measure these objectively? Simplicity seems doable, perhaps use something line LOCs and Cyclomatic complexity, but how idiomatic something is seems purely subjective.

Community
  • 1
  • 1
Sqeaky
  • 1,748
  • 1
  • 18
  • 39
  • 4
    possible duplicate of [Read whole ASCII file into C++ std::string](http://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring) – Chris Drew Aug 23 '15 at 18:38
  • 1
    The linked answer uses seek/tell to find the length of the file. If you know it is a regular file, it is simpler to use stat. – stark Aug 23 '15 at 18:45
  • 1
    `stat` is standard compliant, but the standard is POSIX. – user4581301 Aug 23 '15 at 20:22
  • @user4581301 That is an odd quirk of how I worded my question, however there are still non-posix platforms and at least one is still popular. I meant the standards of the C and C++ programming language. – Sqeaky Aug 23 '15 at 21:24
  • @ChrisDrew I actually read that whole post before writing this one. I was aware it and many post like it existed. None of the answers or even comments provide much difference between the methods of reading a file. There are a few comments, but nothing citing a standard and little explaining why one method might be faster/more compatible than others. – Sqeaky Aug 23 '15 at 21:27
  • 1
    What I suspected. I was qualifying stark's comment. – user4581301 Aug 23 '15 at 21:30
  • 1
    I should have @replied to both of you, I wasn't trying to pick on anyone. Even then `stat` is a viable answer for many. – Sqeaky Aug 23 '15 at 21:37
  • No worries. If you can @ multiple people, that feature's shown up in the few months since I got here. – user4581301 Aug 23 '15 at 21:50

3 Answers3

4

What is a simplest, most idiomatic, best performing and standard compliant way of reading a whole file into an std::string?

those are pertty much contradicting requests, one most likely to lessen the other. simpler code won't be the fastest, or more idiomatic.

after exploring this area for a while I've come to some conclusions:
1) the most performance penalty causing is the IO action itself - the less IO actions taken - the fastest the code
2) memory allocations also quite expensive, but not as expensive as the IO
3) reading as binary is faster than reading as text
4) using the OS API will probably be faster than C++ streams
5) std::ios_base::sync_with_stdio doesn't really effect the performence, it's an urban legend.

using std::getline is probably not the best choice if performence is needed because of these reasons: it will make N IO actions and N allocations for N lines.

A compromise which is fast, standard and elegant is to get the file size, allocate all the memory in one time, then reading the file in one time:

std::ifstream fileReader(<your path here>,std::ios::binary|std::ios::ate);
if (fileReader){
  auto fileSize = fileReader.tellg();
  fileReader.seekg(std::ios::beg);
  std::string content(fileSize,0);
  fileReader.read(&content[0],fileSize);
}   

move the content around to prevent un-needed copies.

David Haim
  • 23,138
  • 3
  • 38
  • 70
  • I added this to benchmark suite I linked in the question. I agree this method is good and the fastest so far, but I disagree with some of your points. I do not think binary is faster than text, I saw no difference terms of milliseconds, on 1000 iterations. I think the answer to this whole question I have may be as simple as your point 1. – Sqeaky Aug 24 '15 at 20:34
  • `std::string(size_t, char)` constructor not only allocates and sets size, but also fills the allocated memory with the given char. I would use `std::unique_ptr(new char[fileSize]);` or maybe `make_unique` - that way you will have exception safety and also avoid initializing the potentially large buffer with `'\0'` – Roman Kruglov Jul 03 '17 at 15:41
  • Defining `content` inside the block will destroy it at block end, so any code using it needs to be in that block, right? – rwst Apr 18 '19 at 06:54
2

This website has a good comparison on several different methods for doing that. The one I currently use is:

std::string read_sequence() {
    std::ifstream f("sequence.fasta");
    std::ostringstream ss;
    ss << f.rdbuf();
    return ss.str();
}

If your text files are separated by newlines, this will keep them. If you want to remove that, for instance (which is my case most of the times), you can just add a call to something such as

auto s = ss.str();
s.erase(std::remove_if(s.begin(), s.end(), 
        [](char c) { return c == '\n'; }), s.end());
LLLL
  • 373
  • 2
  • 5
  • 1
    I will read the website you have, thank you for the lambda remove_if expression, that is a simple to achieve such tasks. Your read buffer to stringstream method does not seem materially different from Max's method, the std::move does not seem to do anything a good compiler doesn't already do. I added RdbufMove as a test to the benchmark suite this question is making me write: https://github.com/Sqeaky/CppFileToStringExperiments – Sqeaky Aug 24 '15 at 19:53
  • Use mmap and subclass string to behave properly. Windows [appears to have a similar](https://msdn.microsoft.com/en-us/library/aa366542(v=vs.85).aspx) facility. – msw Aug 25 '15 at 01:37
  • @msw I have no clue what you said means, nor do I have access to a windows machine. Could you please explain? – Sqeaky Aug 25 '15 at 07:43
  • 1
    @Sqeaky you are right, the std::move there is unnecessary. Thanks for pointing that :-) – LLLL Aug 26 '15 at 22:09
1

There are two big difficulties with your question. First, the Standard doesn't mandate any particular implementation (yes, nearly everybody started with the same implementation; but they've been modifying it over time, and the optimal I/O code for NTFS, say, will be different than the optimal I/O code for ext4), so it is possible (although somewhat unlikely) for a particular approach to be fastest on one platform, but not another. Second, there's a little difficulty in defining "optimal"; I assume you mean "fastest," but that's not necessarily the case.

There are approaches that are idiomatic, and perfectly fine C++, but unlikely to give wonderful performance. If your goal is to end up with a single std::string, using std::getline(std::ostream&, std::string&) very likely to be slower than necessary. The std::getline() call has to look for the '\n', and you'll occasionally reallocate and copy the destination std::string. Even so, it's ridiculously simple, and easy to understand. That could be optimal from a maintenance perspective, assuming you don't need the absolute fastest performance possible. This will also be a good approach if you don't need the whole file in one giant std::string at one time. You'll be very frugal with memory.

An approach that is likely more efficient is to manipulate the read buffer:

std::string read_the_whole_file(std::ostream& ostr)
{
    std::ostringstream sstr;
    sstr << ostr.rdbuf();
    return sstr.str();
}

Personally, I'm just as likely to use std::fopen() and std::fread() (and std::unique_ptr<FILE>) because, on Windows at least, you'll get a better error message when std::fopen() fails than when constructing a file stream object fails. I consider the better error message an important factor when deciding which approach is optimal.

Max Lybbert
  • 18,615
  • 3
  • 41
  • 68
  • 1
    I wrote this and the 3 methods I wrote about into a a microbenchmark: https://github.com/Sqeaky/CppFileToStringExperiments . Do you have ready access to a machine with NTFS? I do not. Somehow the two naive getline strategies were the fastest, then the direct access to the read buffer was marginally but measurably slower, then finally the iterator method was terribly slow. I agree the error message is important, but the quality of that is hard to measure empirically. – Sqeaky Aug 24 '15 at 08:47