5

I’m writing a C++14 program to load text strings from a file, do some computation on them, and write back to another file. I’m using Linux, and the files are relatively large (O(10^6 lines)). My typical approach to this is to use the old C getline and sscanf utilities to read and parse the input, and fprintf(FILE*, …) to write the output files. This works, but I’m wondering if there’s a better way with the goals of high performance and generally recommended approach with the modern C++ standard that I’m using. I’ve heard that iostream is quite slow; if that’s true, I’m wondering if there’s a more recommended approach.

Update: To clarify a bit on the use case: for each line of the input file, I'll be doing some text manipulation (data cleanup, etc.). Each line is independent. So, loading the entire input file (or, at least large chunks of it), and processing it line by line, and then writing it, seems to make the most sense. The ideal abstraction for this would be to get an iterator to the read-in buffer, with each line being an entry. Is there a recommended way to do that with std::ifstream?

Kulluk007
  • 722
  • 1
  • 7
  • 16
  • 1
    Depending on the way you access your input file (sequential or not), using a file mapping API might get you some benefits. Not sure how it is called in the linux world - what I refer to on windows platforms is called ``CreateFileMapping()`` and family of functions. – BitTickler Jul 28 '16 at 20:28
  • 1
    It depends what you want to do with the lines that you read (keep a copy in memory or not) and what kind of parsing you're doing. It would worth showing some snippets. Also this [answer to another question](http://stackoverflow.com/a/33444050/3723423) could interest you; it doesn't address the scanning, but it addresses some other performance aspects, with a link to some benchmarking code. – Christophe Jul 28 '16 at 20:40
  • 2
    If you have the memory to, read the entire file into a buffer in 1 read, process it in memory, and write it all out again in 1 write. If you don't, use a [memory mapped file](https://en.wikipedia.org/wiki/Memory-mapped_file) – David Jul 28 '16 at 20:46
  • I recommend not optimizing until you try the simple way (scanf) and decide it is too slow. – brian beuning Jul 28 '16 at 21:08
  • 1
    @brian:.sure premature optimization is the root of all evil, etc.etc. But ISTM that in this case, it is not premature, since there are apparently performance problems. – Rudy Velthuis Jul 28 '16 at 21:11
  • To clarify a bit on the use case: for each line of the input file, I'll be doing some text manipulation (data cleanup, etc.). Each line is independent. So, loading the entire input file (or, at least large chunks of it), and processing it line by line, and then writing it, seems to make the most sense. The ideal abstraction for this would be to get an iterator to the read-in buffer, with each line being an entry. Is there a recommended way to do that with std::ifstream? – Kulluk007 Jul 28 '16 at 22:28

3 Answers3

12

The fastest option, if you have the memory to do it, is to read the entire file into a buffer with 1 read, process the buffer in memory, and write it all out again with 1 write.

Read it all:

std::string buffer;

std::ifstream f("file.txt");
f.seekg(0, std::ios::end);
buffer.resize(f.tellg());
f.seekg(0);
f.read(buffer.data(), buffer.size());

Then process it

Then write it all:

std::ofstream f("file.txt");
f.write(buffer.data(), buffer.size());
David
  • 25,830
  • 16
  • 80
  • 130
  • I expect this is faster than file mapped IO. All that does is replace fread() with page faults. – brian beuning Jul 28 '16 at 21:21
  • 2
    If the file is sufficiently large, and memory pressure for physical memory high, this code winds up reading the file twice from disk. – IInspectable Jul 28 '16 at 23:03
  • 1
    I consider it as a *brave* answer without knowing the absolute size of the file and the available memory... – Peter VARGA Jul 29 '16 at 03:35
  • @brianbeuning, thank you. If I then wanted to iterate through `buffer.data()`, line by line, would you recommend `getline`? – Kulluk007 Jul 29 '16 at 16:05
  • 1
    @brianbeuning and @Al Bundy, also, could there be some benefit to reading the file in in chunks (not `buffer.size()` but, say, 10000 lines at a time) and iterating through each chunk? – Kulluk007 Jul 29 '16 at 16:06
  • 1
    @Kulluk007 Firstly you can't read X lines at a time, you can read X bytes at a time. You don't know where the newlines are in a chunk of data before you've read it. Reading chunks has to handle the fact that the chunks you read won't start/end on line boundaries - some added complexity. Secondly there's no speed benefit to that unless the file is so big you _must_ do that. And finally, read about memory mapped files if you can't read it all in 1 go. Depending on how you want to read the file, they are probably faster than manual chunk-reading. – David Jul 29 '16 at 16:17
  • 2
    Sorry @David this is all fine, but is not portable: you have no guarantee that the `streampos` returned by `tellg()` will fit into the `string::size_type` expected by the `resize()` - [Demo](https://ideone.com/4KTqWt) – Christophe Aug 02 '16 at 21:25
  • 2
    @Christophe feel free to add a numeric_limits check – David Aug 02 '16 at 21:38
3

If you have C++17 (std::filesystem), there is also this way (which gets the file's size through std::filesystem::file_size instead of seekg and tellg). I presume this would allow you avoid reading twice

It's shown in this answer

Kari
  • 956
  • 7
  • 21
1

I think you could read the file in parallel creating n threads which each have their own offset using david's method, and then pull data into separate area's which you then map to a single location. Check out ROMIO for ideas on how to maximize speed. ROMIO ideas could be done in std c++ without much trouble.

cmg
  • 29
  • 3
  • 6