Streaming into vector?

Question

Have stumbled upon this code to insert the contents of a file into a vector. Seems like a useful thing to learn how to do:

#include <iostream>
#include <fstream>
#include <vector>

int main() {
   typedef std::vector<char> fileContainer;
   std::ifstream testFile("testfile.txt");
   fileContainer container;
   container.assign(
      (std::istreambuf_iterator<char>(testFile)),
      std::istreambuf_iterator<char>());
    return 0;
}

It works but I'd like to ask is this the best way to do such a thing? That is, to take the contents any file type and insert it into an appropriate STL container. Is there a more efficient way of doing this than above? As i understand, it creates a testFile instance of ifstream and fills it with the contents of testfile.txt, then that copy is again copied into the container through assign. Seems like a lot of copying?

As for speed/efficiency, I'm not sure how to estimate the file size and use the reserve function with that, if i use reserve it appears to slow this code down even. At the moment swapping out vector and just using a deque is quite a bit more efficient it seems.

Fred Foo · Answer 1 · 2011-10-17T17:55:03.320

2

it creates a testFile instance of ifstream and fills it with the contents of testfile.txt

No, it opens testfile.txt and calls the handle testFile. There is one copy being made, from disk to memory. (Except that I/O is commonly done by another copy through kernel space, but you're not going to avoid that in a portable way.)

As for speed/efficiency, i'm not sure how to estimate the file size and use the reserve function with that

If the file is a regular file:

std::ifstream testFile("testfile.txt");
testFile.seekg(0, std::ios::end);
std::ios::streampos size = testFile.tellg();
testFile.seekg(0, std::ios::beg);

std::vector<char> container;
container.reserve(size);

Then fill container as before. Or construct it as std::vector<char> container(size) and fill it with

testFile.read(&container.front, size);

Which one is faster should be determined by profiling.

edited Oct 17 '11 at 17:55

answered Oct 17 '11 at 16:46

Fred Foo

328,932
68
689
800

Doesn't that add a zero initialization though? Wouldn't it be more efficient to get the file size, call `container.reserve(size)`, then do the `container.assign((std::istreambuf_iterator...` that he had before? – Mooing Duck Oct 17 '11 at 17:25
@MooingDuck: yes, the zero init can be avoided with `reserve`. Changed the answer. – Fred Foo Oct 17 '11 at 17:33
You still have to use `container.assign` or similar, because at the end of the code you posted, `container` still thinks it's empty, and if you ever copy it or access elements, it will fail miserably. – Mooing Duck Oct 17 '11 at 17:40
The edited version is incorrect; reserve doesn't put any data into the vector, and you can't write to where there is no data. To read the data using `reserve`, you need to call `push_back`, which means an additional test. Depending on the machine, one solution or the other will be most rapid; the one time I measured, on an Intel, with g++, creating the vector with the correct size was more rapid than using `reserve` and `push_back`. – James Kanze Oct 17 '11 at 17:55
@JamesKanze: I was editing as you posted your comment. Please check the new answer. – Fred Foo Oct 17 '11 at 17:56
@Iarsmans: The code is still wrong. It should either `.resize()` and `read()` -or- `.reserve()` and `push_back()` You can't mix-n-match those. In your code the vector still thinks it holds 0 elements. (James' comment about the checks is correct, zero initialization->read is faster than reserve->push_back) – Mooing Duck Oct 17 '11 at 18:50
@Mooing Duck: Any chance of small code sample using reserve and push_back? I get 10ms with http://pastebin.com/DSB7BTVZ as opposed to 2000ms for the stand alone code in original question (when using a 1000kb file). So clearly it's an improvement, but this reserve push_back is confusing a bit. – Daniel Gratz Oct 17 '11 at 19:25
This http://pastebin.com/eKq9FFbT takes 1400ms, so i can't see a reason to use istreambuf_iterator at all over the above. – Daniel Gratz Oct 17 '11 at 19:35
http://codepad.org/XoU5D5Oz Premature optimization bites me again. for my 3498KB file, OP's took 3063 ticks, reserve/pushback took 3047 ticks, zeroinit/read took 47 ticks. Turns out reallocation/move is not the bottleneck. – Mooing Duck Oct 17 '11 at 19:45
Nice code! Thanks. I ran it on same size file and got and 109, 100 and 23 ticks. zeroinit_read() is the way to go. This topic is really interesting to me and i got given the bounty ability from it, maybe will have to post a bounty some time to come up with something even faster than zeroinit_read() ;) – Daniel Gratz Oct 17 '11 at 23:13

score 2 · Accepted Answer · answered Oct 17 '11 at 17:51

I'm not sure that there's a best way, but using the two iterator constructor would be more idiomatic:

FileContainer container( (std::istreambuf_iterator<char>( testFile )),
                         (std::istreambuf_iterator<char>()) );

(I notice that you have the extra parentheses in your assign. They aren't necessary there, but they are when you use the constructor.)

With regards to performance, it would be more efficient to pre-allocate the data, something like:

FileContainer container( actualSizeOfFile );
std::copy( std::istreambuf_iterator<char>( testFile ),
           std::istreambuf_iterator<char>(),
           container.begin() );

This is slightly dangerous; if your estimation is too small, you'll encounter undefined behavior. To avoid this, you could also do:

FileContainer container;
container.reserve( estimatedSizeOfFile );
container.insert( container.begin(),
                  std::istreambuf_iterator<char>( testFile ),
                  std::istreambuf_iterator<char>() );

Which of these two is faster will depend on the implementation; the last time I measured (with g++), the first was slightly faster, but if you're actually reading from file, the difference probably isn't measurable.

The problem with these two methods is that, despite other answers, there is no portable way of finding the file size other than by actually reading the file. Non-portable methods exist for some systems (fstat under Unix), but on other systems, like Windows, there is no means of finding the exact number of char you can read from a text file. And of course, there's no guarantee that the results of tellg() will even convert to an integral type, and that if it does, that they won't be a magic cookie, with no numerical signification.

Having said that, in practice, the use of tellg() suggested by other posters will often be "portable enough" (Windows and most Unix, at least), and the results will often be "close enough"; they'll usually be a little too high under Windows (since the results will count the carriage return characters which won't be read), but in a lot of cases, that's not a big problem. In the end, it's up to you to decide what your requirements are with regards to portability and precision of the size.

Several great answers, but i appreciate the detail in this answer and it makes me understand overall more clearly some of the possible solutions. — Daniel Gratz, Oct 17 '11 at 18:53
Since `std::istreambuf_iterator` appears to be an unformulated operation, it will return the correct size in Windows, not overestimate. It will insert the carriage returns as well. Also, `streampos` is convertible to `streamoff`, which converts to `streamsize`, which must be "one of the signed basic integral types". so `streamoff(tellg())` is guaranteed to convert to an integral type. — Mooing Duck, Oct 17 '11 at 18:59
@MooingDuck `std::istreambuf_iterator` does not return a size, and under Windows, it **will** read less `char`s than you get from determining the size of the file (since it won't read the carriage returns). And while not the case under Windows (or any of the Unices I know), `streamoff` can be a user defined type, in which case, the conversion `streampos` to `streamsize` would involve two user defined conversions (and thus not be implicit). And of course, as I said, the results of the conversion aren't necessarily numerically useful. — James Kanze, Oct 18 '11 at 09:15
I worded my comment poorly, and was partially wrong. Although std::istreambuf_iterator is an unformatted operation, I was wrong; it does indeed remove \r from newlines, causing windows to overestimate the buffersize. I didn't think it would. However, `streamoff(tellg())` is guaranteed to _convert_ to an integral type. — Mooing Duck, Oct 18 '11 at 16:03
@MooingDuck `streamoff(tellg())` is guaranteed to be convertible to an integral type. There's still no guarantee with regards to the meaning of the numerical value of the results. Formally, of course. In practice, I think most implementations will do something "reasonable" (but I can imagine strange results under legacy OS's, like those on some of the mainframes). — James Kanze, Oct 18 '11 at 16:25

score 1 · Answer 3 · answered Oct 17 '11 at 16:48

The std::ifstream is not fulled with the contents of the file, the contents are read on demand. Some kind of buffering is involved, so the file would be read in chunks of k-bytes. Since stream iterators are InputIterators, it should be more efficient to call reserve on the vector first; but only if you already have that information or can guess a good approximate, otherwise you would have to iterate through the file contents twice.

score 1 · Answer 4 · edited May 23 '17 at 11:47

People much more frequently want to read from a file into a string than a vector. If you can use that, you might want to see the answer I posted to a previous question.

A minor edit of the fourth test there will give this:

std::vector<char> s4;

file.seekg(0, std::ios::end);
s4.resize(file.tellg());
file.seekg(0, std::ios::beg);

file.read(&s4[0], s4.size());

My guess is that this should give performance essentially indistinguishable from the code using a string. Depending on your compiler/standard library, this is likely to be substantially faster than your current code (again, see the timing results there for some idea of the difference you're likely to see).

Also note that this gives a little extra ability to detect and diagnose errors. For example, you can check whether you successfully read the entire file by comparing s4.size() to file.gcount() (and/or check for file.eof()). This also makes it a bit easier to prevent problems by limiting the amount you read, in case somebody decides to see what happens when/if they try to use your program to read a file that's, say, 6 terabytes.

`s4` is a `vector`, it has no length() but size(). – K-ballo Oct 17 '11 at 17:31 — K-ballo, Oct 17 '11 at 17:31
@K-ballo: oops. Thank you. Corrected. – Jerry Coffin Oct 17 '11 at 17:32 — Jerry Coffin, Oct 17 '11 at 17:32

score 0 · Answer 5 · answered Oct 17 '11 at 16:58

There is definitely a better way if you want to make it efficient. You can check the file size, pre-allocate vector and read directly into vector's memory. A simple example:

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <cstdio>
#include <cstdlib>
#include <vector>
#include <iostream>

using namespace std;

int main ()
{
    int fd = open ("test.data", O_RDONLY);

    if (fd == -1)
    {
        perror ("open");
        return EXIT_FAILURE;
    }

    struct stat info;
    int res = fstat (fd, &info);

    if (res != 0)
    {
        perror ("fstat");
        return EXIT_FAILURE;
    }

    std::vector<char> data;

    if (info.st_size > 0)
    {
        data.resize (info.st_size);
        ssize_t x = read (fd, &data[0], data.size ());

        if (x != info.st_size)
        {
            perror ("read");
            return EXIT_FAILURE;
        }

        cout << "Data (" << info.st_size << "):\n";
        cout.write (&data[0], data.size ());
    }
}

There are other more efficient ways for some tasks. For example, to copy file without transferring data to and from user space, you can use sendfile etc.

score -1 · Answer 6 · answered Oct 17 '11 at 16:51

-1

It does work, and it is convenient, but there are many situations where it is a bad idea.

Error handling in a user-edited file, for example. If the user has hand edited a data file or it has been imported from a spreadsheet or even a database with lax field definitions, then this method of filling the vector will result in a simple error with no detail.

In order to process the file and report where the error happened, you need to read it line by line and attempt the conversion to a number on each line. Then you can report the line number and the text that failed to convert. This is extremely useful. Without this feature the user is left to wonder which line caused the problem instead of being able to immediately fix it.

answered Oct 17 '11 at 16:51

Zan Lynx

49,393
7
74
125

In both cases, there will be no error until you interpret the data. Having the data in a stream vs a vector should not affect that in the slightest. – Mooing Duck Oct 17 '11 at 17:27
@MooingDuck: What, are you just reading into a vector of strings? What is the point of doing that? Real programs read into data structures of some kind. – Zan Lynx Oct 17 '11 at 21:11
@MooingDuck: Okay, I see that in the original question he *is* reading into a vector of char. I'm going to believe that's for sake of the example and not a real program. – Zan Lynx Oct 17 '11 at 21:12
Since he's using an unformatted read into a vector of char and complaining that it's slow, I'd assume that that _is_ his real program. (Though you are right that it's not a good idea) – Mooing Duck Oct 17 '11 at 21:21

Streaming into vector?

6 Answers6