5

I would like to know what's the performance overhead of

string line, word;
while (std::getline(cin, line))
{
    istringstream istream(line);
    while (istream >> word)
        // parse word here
}

I think this is the standard c++ way to tokenize input.

To be specific:

  • Does each line copied three times, first via getline, then via istream constructor, last via operator>> for each word?
  • Would frequent construction & destruction of istream be an issue? What's the equivalent implementation if I define istream before the outer while loop?

Thanks!

Update:

An equivalent implementation

string line, word;
stringstream stream;
while (std::getline(cin, line))
{
    stream.clear();
    stream << line;
    while (stream >> word)
        // parse word here
}

uses a stream as a local stack, that pushes lines, and pops out words. This would get rid of possible frequent constructor & destructor call in the previous version, and utilize stream internal buffering effect (Is this point correct?).

Alternative solutions, might be extends std::string to support operator<< and operator>>, or extends iostream to support sth. like locate_new_line. Just brainstorming here.

csyangchen
  • 373
  • 1
  • 9
  • 1
    Honestly, I expect the fact you're doing I/O -- especially with user input -- to be several orders of magnitude more significant than the fact you might be allocating a small array a few more times than absolutely necessary. –  Jun 09 '12 at 14:42

2 Answers2

6

Unfortunately, iostreams is not for performance-intensive work. The problem is not copying things in memory (copying strings is fast), it's virtual function dispatches, potentially to the tune of several indirect function calls per character.

As for your question about copying, yes, as written everything gets copied when you initialize a new stringstream. (Characters also get copied from the stream to the output string by getline or >>, but that obviously can't be prevented.)

Using C++11's move facility, you can eliminate the extraneous copies:

string line, word;
while (std::getline(cin, line)) // initialize line
{       // move data from line into istream (so it's no longer in line):
    istringstream istream( std::move( line ) );
    while (istream >> word)
        // parse word here
}

All that said, performance is only an issue if a measurement tool tells you it is. Iostreams is flexible and robust, and filebuf is basically fast enough, so you can prototype the code so it works and then optimize the bottlenecks without rewriting everything.

Potatoswatter
  • 126,977
  • 21
  • 238
  • 404
  • You are really doing injustice to the `basic_streambuf` design; what you describe should only happen if the stream buffer that `istringstream` uses is completely unbuffered. –  Jun 09 '12 at 14:38
  • @Hurkyl Or if there is an encoding conversion (such as UTF-8), or if the extracted fields are small (each extraction is at least one virtual call). `filebuf` is relatively fast, as I mentioned in the last paragraph, but in practice stream extraction turns into a bottleneck sooner than you might expect. – Potatoswatter Jun 09 '12 at 16:20
  • I can't talk about locales, but I don't see the virtual call in small extracted fields. Skimming the docs, `basic_istream` doesn't have any virtual members, and the actual extraction of characters from the steambuf would done with non-virtual functions like `sbumpc` which are just pointer arithmetic and range checking, and the virtual functions only get invoked under conditions like the buffer being empty. –  Jun 09 '12 at 16:28
  • @Hurkyl The calls particular to extraction are in the locales. Locale facets define the formatting of numbers, etc. Much of the problem is also that the functions called with that indirect branch also tend to be slow. The locales section of the standard defines a lot of functionality, much of it never used, and I'm not aware of a library that actually attempts to optimize the "common case." Try profiling the parsing a large CSV file and see on your platform. – Potatoswatter Jun 09 '12 at 16:32
  • The `std::istringstream` constructor takes strings only as `const std::string&`, therefore `std::move` will not avoid a copy. The [reference](http://www.cplusplus.com/reference/sstream/istringstream/istringstream/) also explicitly says: `str: A string object, whose content is copied.` – nspo May 01 '21 at 10:49
1

When you define a variable inside a block, it will be allocated on the stack. When you are leaving the block it will get popped from the stack. Using this code you have a lot of operation on the stack. This goes for 'word' too. You can use pointers and operate on pointers instead of variables. Pointers are stored on the stack too but where they are pointing to is a place inside the heap memory.

Such operations can have overhead for making the variables, pushing it on the stack and popping it from the stack again. But using pointers you allocate the space once and you work with the allocated space in the heap. As well pointers can be much smaller than real objects so their allocation will be faster.

As you see getLine() method accepts a reference(some kind of pointers) to line object which make it work with it without creating a string object again.

In your code , line and word variables are made once and their references are used. The only object you are making in each iteration is ss variable. If you want to not to make it in each iteration you can make it before loop and initialize it using its relates methods. You can search to find a suitable method to reassign it not using the constructor.

You can use this :

string line, word ;
istringstream ss ;
while (std::getline(cin, line))
{
    ss.clear() ;
    ss.str(line) ;
    while (ss >> word) {
        // parse word here
    }
}

Also you can use this reference istringstream

EDIT : Thanks for comment @jrok. Yes, you should clear error flags before assigning new string. This is the reference for str() istringstream::str

Kamran Amini
  • 1,012
  • 8
  • 14