c++ overhead from string concatenation

Question

I'm reading in a text file of random ascii from an ifstream. I need to be able to put the whole message into a string type for character parsing. My current solution works, but I think i'm murdering process time on the more lengthy files by using the equivalent of this:

std::string result;

for (std::string line; std::getline(std::cin, line); )
{
    result += line;
}

I'm concerned about the overhead associated with concatenating strings like this (this is happening a few thousand times, with a message 10's of thousands of characters long). I've spent the last few days browsing different potential solutions, but nothing is quite fitting... I don't know the length of the message ahead of time, so I don't think using a dynamically sized character array is my answer.

I read through this SO thread which sounded almost applicable but still left me unsure;

Any suggestions?

You can construct the string on one go using [streambuf iterators](http://en.cppreference.com/w/cpp/iterator/istreambuf_iterator). — Kerrek SB, Apr 05 '13 at 21:44
@mfontanini: No, he just made it up out of the blue for fun. Didn't have anything better to do with his time, much like you and I. — Lightness Races in Orbit, Apr 05 '13 at 21:45
I think your console input is much much slower than the actual string concatenation. — dtech, Apr 05 '13 at 21:45
@ddriver: Read the question; the real code reads from a file. — Lightness Races in Orbit, Apr 05 '13 at 21:45
ifstream means a file, a file means you can stat, you can stat means you know its size; am I missing something? — Alex Chamberlain, Apr 05 '13 at 21:52
Some perhaps-helpful discussion is here: http://stackoverflow.com/questions/116038/what-is-the-best-way-to-slurp-a-file-into-a-stdstring-in-c and here: http://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring?rq=1 — Kristopher Johnson, Apr 05 '13 at 21:53
@ddriver: And if it's a 15GB file? And the OP wants to read from it until a certain condition is met? Or until a certain delimeter is spotted? There are _plenty_ of reasons for the stated constraints. — Lightness Races in Orbit, Apr 05 '13 at 22:10
@LightnessRacesinOrbit - well, he will probably be the first person to parse a 15 GB file. And considering he is reading the entire file line by line PLUS doing string concatenation (that same 15GB string) - and he is not parsing line by line. — dtech, Apr 05 '13 at 22:15
OK, so if he doesn't know the size and he doesn't want to concatenate, why not simply parse line by line? Not that concatenation is that much of a valid concern IMhO. — dtech, Apr 05 '13 at 22:23

Qortex · Answer 1 · 2013-04-05T22:21:04.000

1

The problem really is that you don't know the full size ahead of time, so you cannot allocate memory appropriately. I would expect that the performance hit you get is related to that, not to the way strings are concatenated since it is efficiently done in the standard library.

Thus, I would recommend deferring concatenation until you know the full size of your final string. That is, you start by storing all your strings in a big vector as in:

using namespace std;
vector<string> allLines;
size_t totalSize = 0;
// If you can have access to the total size of the data you want
// to read (size of the input file, ...) then just initialize totalSize
// and use only the second code snippet below.
for (string line; getline(cin, line); )
{
    allLines.push_back(line);
    totalSize += line.size();
}

Then, you can create your big string knowing its size in advance:

string finalString;
finalString.reserve(totalSize);
for (vector<string>::iterator itS = allLines.begin(); itS != allLines.end(); ++itS)
{
    finalString += *itS;
}

Although, I should mention that you should do that only if you experience performance issues. Don't try to optimize things that do not need to, otherwise you will complicate your program with no noticeable benefit. Places where we need to optimize are often counterintuitive and can vary from environment to environment. So do that only if your profiling tool tells you you need to.

edited Apr 05 '13 at 22:21

answered Apr 05 '13 at 22:06

Qortex

5,389
2
30
55

1

What is that kind of file you can't get the length in bytes? – dtech Apr 05 '13 at 22:08
As you can see in OP question, he's using `cin` here which is an example. Also, you can access network resources that you receive piece by piece, and so on, and so on, and so on. (and so on) – Qortex Apr 05 '13 at 22:10
@Mic: Did you forget to read the very first line? `I'm reading in a text file of random ascii from an ifstream.` However there are still valid reasons for the constraints. I think ddriver is being rather short-sighted, to be honest. – Lightness Races in Orbit Apr 05 '13 at 22:13
@ddriver: `/dev/random` is a file, but what is its size in bytes? – Joker_vD Apr 05 '13 at 22:17
@LightnessRacesinOrbit Ok yeah, forgot that, paid attention to the code. Anyway, the OP question suggests he doesn't want to base his solution on file length. Edited my answer in that sense. – Qortex Apr 05 '13 at 22:19
Not a linux guy, but yes, it is possible to have a source of unknown size. – dtech Apr 05 '13 at 22:20
2

Lightness is correct here, my stopping point _is_ truly unknown. I stop based on information I'm gathering from the stream during the read-in (i'm sort of decrypting small groups of these characters in 'real time', to make a long story short! ). So simply making storage space == to the size of the file is wasteful. I've read some other things here that might be useful, thanks for all the input- – Jeremy Apr 05 '13 at 22:37
So why didn't you mention that in your question? And why not simply find the "terminating sequence" index in the file and read everything from start to that index at once? Or maybe it is not a file, you just say it is a file and you just post code it is console input? – dtech Apr 05 '13 at 23:11
Stop ranting. Everybody seems to have understood it is kind of a theoritical question, and OP will figure out from the answers the way to go with his specific issue. Having a more general problem is more helpful than a precise one in this case, since it might later be of interest for other people. – Qortex Apr 05 '13 at 23:14

score 0 · Answer 2 · answered Apr 05 '13 at 21:47

0

If you know the file size, use result's member function 'reserve()' once.

answered Apr 05 '13 at 21:47

dan

965
8
24

Knowing the size of the file isn't the same as knowing the length of the content that he's interested in reading in. – Apr 05 '13 at 22:12
@Sancho: Correct. Yet, the latter is what is useful. Reserving the _entire_ filesize is fairly dumb if the file is 600MB and your message length turns out to be 1MB. – Lightness Races in Orbit Apr 05 '13 at 22:16
@LightnessRacesinOrbit - "fairly dumb", "rather short-sighted"... condescend much? The OP doesn't seem to know what he wants to do and you come up with amazing scenarios to portray some kind of superiority... Not cool ;) – dtech Apr 05 '13 at 22:31
@ddriver: \*shrug\* I'm just saying what I see. Reserving 600x your necessary memory _is_ "fairly dumb", and you _are_ being "rather short-sighted" in your assumptions as to what the OP can and cannot do. I managed to parse the requirements just fine, and [my interpretation has been confirmed](http://stackoverflow.com/questions/15844074/c-overhead-from-string-concatenation/15844149?noredirect=1#comment22546731_15844416). – Lightness Races in Orbit Apr 05 '13 at 22:51
@LightnessRacesinOrbit - yeah, yeah, we get it, you are great ;) too bad he didn't confirm the 15 GB input file as well, that would have been a real blast – dtech Apr 05 '13 at 23:00
Wow how much debate can come out of an ill-specified problem! @LightnessRacesinOrbit- I agree that reserving memory is "fairly dumb" provided that there actually exists a fairly smart solution – dan Apr 05 '13 at 23:12

score 0 · Answer 3 · answered Apr 05 '13 at 21:48

0

I'm too sleepy to put together any solid data for you but, ultimately, without knowing the size ahead of time you're always going to have to do something like this. And the truth is that your standard library implementation is smart enough to handle string resizing fairly smartly. (That's despite the fact that there's no exponential growth guarantee for std::string, the way that there is for std::vector.)

So although you may see unwanted re-allocations the first fifty or so iterations, after a while, the re-allocated block becomes so large that re-allocations become rare.

If you profile and find that this is still a bottleneck, perhaps use std::string::reserve yourself with a typical quantity.

answered Apr 05 '13 at 21:48

Lightness Races in Orbit

358,771
68
593
989

Rather than risk a potential reallocation on every concatenation, how about collecting the substrings in a linked list and concatenating them all at the end? Albeit with that solution you're calling `new` on every substring, so it may not work out to be more efficient overall than the less frequent but more costly *reallocations* of std::string. – JBentley Apr 05 '13 at 21:51
2

The locality of reference means lists are nearly never useful; a rope may make sense, but I seriously doubt the performance issues are due to the memory allocation... – Alex Chamberlain Apr 05 '13 at 21:55
@AlexChamberlain: Well the subsequent copy won't be helping – Lightness Races in Orbit Apr 05 '13 at 22:05
Regarding complexity - cppreference [says](http://en.cppreference.com/w/cpp/string/basic_string/basic_string) that string constructor taking range has linear complexity, but I don't see confirmation in standard (though for vector it presents). Interesting, is it bug of cppreference? – Evgeny Panasyuk Apr 05 '13 at 22:18
@EvgenyPanasyuk: Table 99, row 5. I do find container requirements most sub-optimally laid-out. – Lightness Races in Orbit Apr 05 '13 at 22:27
1

@LightnessRacesinOrbit, well, I saw this line :) It is not what it looks like - **X(t, m)**. **m** is allocator, not iterator. – Evgeny Panasyuk Apr 05 '13 at 22:34
@EvgenyPanasyuk: Oops, yeah. Okay then I have no idea what the standard is doing for this case. `Effects: If InputIterator is an integral type, equivalent to basic_string(static_cast(begin), static_cast(end), a)` LOLWUT – Lightness Races in Orbit Apr 05 '13 at 22:42
@LightnessRacesinOrbit, Yeah, I also was confused to see this :) After staring to it several minutes, I come up with idea that it is sepcial handling of case [string('\x5','0')](http://liveworkspace.org/code/2Gw36W), i.e. string(n,ch) – Evgeny Panasyuk Apr 05 '13 at 23:02
@EvgenyPanasyuk: But that breaks the semantics of the overload >. – Lightness Races in Orbit Apr 05 '13 at 23:24

score 0 · Answer 4 · answered Apr 05 '13 at 21:55

You're copying the result array for every line in the file (as you expand result). Instead pre-allocate the result and grow it exponentially:

std::string result;
result.reserve(1024); // pre-allocate a typical size

for (std::string line; std::getline(std::cin, line); )
{
    // every time we run out of space, double the available space
    while(result.capacity() < result.length() + line.length())
        result.reserve(result.capacity() * 2);

    result += line;
}

c++ overhead from string concatenation

4 Answers4