3

A number of stackoverflow answers deal with how to slurp in files from disk where you can preallocate memory based on the file size.

But what is the fastest way to slurp in stdin (e.g. when a large file is piped into your program)?

I am happy to slurp into a vector (which can always be converted into a std::string later) if that is the fastest solution.

Community
  • 1
  • 1
Leo Goodstadt
  • 2,001
  • 1
  • 19
  • 22

2 Answers2

2

The fastest way to read unformatted data into memory is to use unformatted read routines. fstream::read(), for example. Nothing will beat it.

BEWARE! Some folks claim you will see performance improvement by using OS-level routines, like read(). You will get a tremendouse performance degradation if you try this.

EDIT. Some explanation of the above statement. The reason for the degradation would be the kernel calls. Every read is a kernel call, so unless you read exactly in sizes of optimal data buffer, you will incure more calls to kernel or less optimal reads. While you can experimentally figure out the best read size, the C runtime has already done this for you. fread() and unformatted stream read has already been optimized, so no matter how big your reading chunks are, you are guranteed to call kernel in the most optimal way.

SergeyA
  • 56,524
  • 5
  • 61
  • 116
  • Of course, not all unformatted read routines are created equal. – Ben Voigt Oct 05 '15 at 17:39
  • Do you have anything to back up your last statement? – NathanOliver Oct 05 '15 at 17:43
  • @NathanOliver, yes. Months of testing, as well as common sense. It is pretty obvious actually. – SergeyA Oct 05 '15 at 17:44
  • 2
    @SergeyA How can be obvious that a system call which is low level is slower than a function which will end using that system call anyway? – lilezek Oct 05 '15 at 17:44
  • @SergeyA: Funny, because my testing, reproduced by dozens of other users, found that C++ iostreams are a bottleneck: http://stackoverflow.com/q/4340396/103167 Unfortunately the benchmark code got deleted by ideone.com, so you can't benchmark the exact same code today, but it was confirmed by many many people at the time, and it isn't hard to write your own test. – Ben Voigt Oct 05 '15 at 17:47
  • @BenVoigt, it is formatted input which slows you down. Unfortmatted input is as fast as it gets. – SergeyA Oct 05 '15 at 17:51
  • 1
    @lilezek, because of buffered IO vs unbuffered IO. – SergeyA Oct 05 '15 at 17:52
  • @SergeyA As you acknowledged in the other answer, both `std::istreambuf_iterator ` and `istream::read()` are unformatted but the former is several times slower. So clearly something else is going on. I guess the number of function calls which can't be optimised away... – Leo Goodstadt Oct 06 '15 at 14:17
  • @BenVoigt Were you testing the same thing? I.e. reading / writing a big chunk of data in one operation? Unfortunately, all your code has vanished into the ether so I can only guess what you were doing from your answer. I looks like you ran into formatted output overhead + iterator overhead. – Leo Goodstadt Oct 06 '15 at 14:20
  • @LeoGoodstadt, of course. Every call to iterator is a call to read. You are reading in increments of 1 byte. – SergeyA Oct 06 '15 at 14:25
  • @LeoGoodstadt: No formatted output whatsoever, I was calling `ostream::write()` repeatedly. It was mainly buffer testing (which is exactly what Sergey incorrectly claimed was the performance advantage of iostreams) Using large chunks is in fact the way to minimize the cost of iostreams -- but then there's absolutely no reason to use them. Contrary to what Sergey thinks, OS routines also do readahead buffering, and do so far more efficiently than the compiler-bundled libraries -- disk cache is implemented in the OS and a lot of work goes into tuning it. – Ben Voigt Oct 06 '15 at 14:25
  • @BenVoigt I agree, up to a point... if you ignore parallel i/o. We deal with a lot of embarrassingly parallel problems. For our cluster based file systems, the throughput is massive (writing from thousands of nodes. Latency is obviously not so good). However we also have a some supermicron servers with lots of shared memory (512Gb and up) and 48 cores. The disk arrays with top-of-the-line caching RAID controllers don't cope with > 6 processes hitting i/o hard. Contention kills performance for everyone. It is also always nicer to use portable c++ if poss than low level OS specific calls. – Leo Goodstadt Oct 06 '15 at 17:55
  • @BenVoigt, you misunderstand my point. Of course, there is a readahead and caching and whatnot. Still, every read() is a kernel call. It is a context switch. Remember, kernel does not multitask, so there can only be one read() call for the whole system! I am surprised you do not know this. – SergeyA Oct 06 '15 at 18:03
  • @Leo: "better to use portable C++" -- yes, that's why my recommendation is `fread` in large blocks, unless I/O is your bottleneck. (You may ask, if I/O isn't the bottleneck, does it matter at all? And I say, "Yes!" Wasting processor cycles causes heat and runs down battery.) And parallelism is another good reason to use the OS calls... neither iostreams nor stdio provide any mechanism for asynchronous, overlapped, multirequest, or scatter/gather. RAID controllers should help with contention, by doing elevator sorting. Solid state helps even more. – Ben Voigt Oct 06 '15 at 18:21
  • 1
    @SergeyA: "Kernel does not multitask"??? Of course it does, what decade are you living in? "there can only be one read() call for the whole system" Complete baloney. – Ben Voigt Oct 06 '15 at 18:22
  • Show me how you can call read() from two threads in your application and have it executed at the same time and than you will earn the privilige for calling me 'baloney' – SergeyA Oct 06 '15 at 18:26
  • @SergeyA: I don't even need two threads. Just call `io_submit` with `nr` greater than one. – Ben Voigt Oct 06 '15 at 18:29
  • @SergeyA: The claim in your answer is about "OS-level routines". That means `read()`, `aio_read()`, `io_submit()`, `recvfrom()`, and many more. And yes, `read()` does need multiple threads, and they will make progress simultaneously (one thread goes into I/O wait and the other thread issues its call, then they are both outstanding on the PCI bus or whatever). But if you care about I/O efficiency, use `io_submit` (Linux) or OVERLAPPED I/O and maybe completion ports (Win32). – Ben Voigt Oct 06 '15 at 18:32
  • @SergeyA: More to the point, if two threads issue `read()` and both requests are satisfied from disk cache, or from a pipe buffer, you don't think both memcpy-like operations won't take place simultaneously on multiple cores? – Ben Voigt Oct 06 '15 at 18:34
  • @BenVoigt, I see no reason to continue discussion further. Just 2 month ago I imporved the performance of a block of code which did a lot of file reading by removing read() calls and replacing them with fstream::read(). The performance was about 3 times better, from 30 minutes to 10 minutes on reading of around 2.5 GB of data. Everyone entitled to their own opinion, of course. – SergeyA Oct 06 '15 at 18:35
  • @SergeyA: 10 minutes for 2.5 GB of data? That should have taken 30 seconds, tops. 5 seconds on an SSD. – Ben Voigt Oct 06 '15 at 18:35
  • @BenVoigt, the reading was done over NFS :) I should've mentioned this. – SergeyA Oct 06 '15 at 18:36
  • @SergeyA: If you're saying that making piecemeal calls to OS functions harms performance, then I agree. So does making piecemeal calls to iostreams, because iostreams buffer management is horrible (approximately one hundred to one thousand times as expensive as necessary, no joke). The hierarchy is something like *(best)* OS block transfers > stdio block transfers > stdio piecemeal = iostreams block transfers (disable sync stdio) >> stdio formatted input >>> iostreams piecemeal = iostreams block transfers (enabled sync stdio) > iostreams formatted input >>> OS piecemeal *(worst)* – Ben Voigt Oct 06 '15 at 22:28
1

Reading into a fixed-sized buffer in a loop

To my surprise, old-fashioned, almost c-like code seems to be the fastest with both clang and gcc:

{
    vector<char> cin_str;
    // 64k buffer seems sufficient
    std::streamsize buffer_sz = 65536;
    vector<char> buffer(buffer_sz);
    cin_str.reserve(buffer_sz);

    auto rdbuf = cin.rdbuf();
    while (auto cnt_char = rdbuf->sgetn(buffer.data(), buffer_sz))
        cin_str.insert(cin_str.end(), buffer.data(), buffer.data() + cnt_char);
}

Using istream::read() and istream::gcount() was as fast but required a little extra code...

c++ iterators

Surprisingly, using istreambuf_iterator (iterator for unformatted input) turned out to be much, much slower: >3x for some test files, even after switching off sync with stdio.

{
    std::ios_base::sync_with_stdio(false) ;
    vector<char> cin_str;
    //              64k
    std::streamsize buffer_sz = 65536;
    cin_str.reserve(buffer_sz);

    std::istreambuf_iterator<char> iit (std::cin.rdbuf()); // stdin iterator
    std::istreambuf_iterator<char> eos;                    // end-of-range iterator
    std::copy(iit, eos, std::back_inserter(cin_str));
    return cin_str;
}

This is true even after reserving space for the vector buffer (rather than just assigning to it).

The other surprise is that a see (near) maximum speed even with a very modest buffer size (64 kb). vector just has a very efficient reallocation strategy.

Addendum:

Google-ing finds this blog post

(http://insanecoding.blogspot.in/2011/11/reading-in-entire-file-at-once-in-c.html) from 2011 which seems to show that this approach is about as fast as you can go in c++ (in gcc/clang), and switching to cstdio does not provide further gains (but obviously makes the code even uglier!).

Avoiding copies

@BenVoigt points out that the read data can be placed in place by sgetn() / istream::read() if we judiciously preallocate the requisite space:

{
    std::ios_base::sync_with_stdio(false) ;
    //              64k
    std::streamsize buffer_sz = 65536;
    vector<char> cin_str(buffer_sz);
    std::streamsize cin_str_data_end = 0U;

    auto rdbuf = cin.rdbuf();
    while (auto cnt_char = rdbuf->sgetn(cin_str_data_end + cin_str.data(), buffer_sz))
    {
        cin_str_data_end += cnt_char;
        cin_str.resize(cin_str_data_end + buffer_sz);
    }
    cin_str.resize(cin_str_data_end);
    return cin_str;
}

In testing, this resulted in no further speedups probably because this code is dominated by 1) i/o 2) system call overhead 3) vector memory allocation

Is there a faster way to do this? Memory mapped files from boost?

Community
  • 1
  • 1
Leo Goodstadt
  • 2,001
  • 1
  • 19
  • 22
  • Do you have data to back this up? What compilation settings did you use? Also why does your answer end with a question? – NathanOliver Oct 05 '15 at 17:36
  • You can't memory-map a pipe or tty, so very unlikely to work with `stdin`. You could potentially play virtual memory tricks to extend the buffer by committing additional pages in a large reserved range, thus avoiding copying data during resize of buffer. And slowness of `istreambuf` isn't surprising to anyone who has ever profiled C++ I/O, the poor performance of iostreams is legendary. You'd probably get a speed improvement using `fread` or even the OS-specific `read` function (on Windows, `ReadFile`). – Ben Voigt Oct 05 '15 at 17:36
  • Why are you surprised? iostream iterators do formatted input, of course they will be slower. – SergeyA Oct 05 '15 at 17:36
  • 3
    Did you forget to turn off stdin sync? – Lightness Races in Orbit Oct 05 '15 at 17:38
  • stdin is not memory-mappable, of course. It does not have a name. – SergeyA Oct 05 '15 at 17:39
  • @BenVoigt WRONG WRONG WRONG! – SergeyA Oct 05 '15 at 17:40
  • @SergeyA: `stdin` is just file descriptor `0`. It is memory-mappable if file descriptor `0` has a normal file open, but not for pipe or tty. – Ben Voigt Oct 05 '15 at 17:40
  • 1
    @SergeyA: Your comment would be more useful if it said what you think is wrong... – Ben Voigt Oct 05 '15 at 17:41
  • @BenVoigt, using OS-level routines (i.e. `read`) will lead to a terrible performance degradation. – SergeyA Oct 05 '15 at 17:42
  • @BenVoigt, I take my comment in it's generality. Yeah, if you are reading your stdin from file because it is redirected, you can mmap it. But usually it is console input, this is why it is called stdin, and this is not mappable. – SergeyA Oct 05 '15 at 17:43
  • @SergeyA: That's why my comment starts off "you can't memory-map a pipe or tty".... and my profiler data shows that C++ iostreams (both Visual C++ and g++ implementations) are slower than my disk. And I got a very real speed improvement switching to stdio and then a smaller one switching to OS routines. – Ben Voigt Oct 05 '15 at 17:46
  • Agree with @LightnessRacesinOrbit; when you're working with stdin/stdout/stderr, failing to set [`ios_base::sync_with_stdio(false)`](http://www.cplusplus.com/reference/ios/ios_base/sync_with_stdio/) will kill the relative performance of C++ `cin`/`cout`/`cerr`. – ShadowRanger Oct 05 '15 at 17:48
  • @SergeyA. Why do you think that `std::istreambuf_iterator` does formatted input? I am guessing that the slow down comes instead from character caching / repeated system function calls. Any ideas? – Leo Goodstadt Oct 06 '15 at 11:04
  • @ LightnessRacesinOrbit @ShadowRanger. Not setting `ios_base::sync_with_stdio(false)` does indeed slow down reading from stdin. But `istreambuf_iterator` is much slower even without syncing: On my system, `cat`-ing a 400 Mb file into stdin: `1s` for `sgetn()` / `istream::read()`; `3.6s` for `std::istreambuf_iterator`; `19s` for `std::istreambuf_iterator` without `ios_base::sync_with_stdio(false)`. Clearly something else is going on. – Leo Goodstadt Oct 06 '15 at 11:06
  • @BenVoigt Thank Goodness that using fread() does not provide further speedups on my system. Are there any lower level insane i/o techniques that c++ competition guys use? My test harness is very yucky or I would just post it. – Leo Goodstadt Oct 06 '15 at 11:17
  • @LeoGoodstadt, sorry, I misread it. I though you've used stream iterator. Well, streambuf iterators I never used. – SergeyA Oct 06 '15 at 14:00
  • @Leo: BTW, your call to `insert` is an extra unnecessary copy. If you resize your `cin_str` vector, you can read directly into it. You could get rid of the `buffer` vector and just keep track of how many valid bytes are in `cin_str`, then truncate at the end. – Ben Voigt Oct 06 '15 at 18:27
  • @BenVoigt Thanks. I have included your suggestion in the code. I find, however, that it makes no difference to performance. This is unsurprising as this sort of copy / extend code tends to be dominated by the high cost of memory allocation (see various talks from Alexandrescu). The resulting code is also (to my mind) much more fragile and liable to having bugs introduced at review later! – Leo Goodstadt Oct 07 '15 at 12:37