4

I'm reading in a text file of random ascii from an ifstream. I need to be able to put the whole message into a string type for character parsing. My current solution works, but I think i'm murdering process time on the more lengthy files by using the equivalent of this:

std::string result;

for (std::string line; std::getline(std::cin, line); )
{
    result += line;
}

I'm concerned about the overhead associated with concatenating strings like this (this is happening a few thousand times, with a message 10's of thousands of characters long). I've spent the last few days browsing different potential solutions, but nothing is quite fitting... I don't know the length of the message ahead of time, so I don't think using a dynamically sized character array is my answer.

I read through this SO thread which sounded almost applicable but still left me unsure;

Any suggestions?

Community
  • 1
  • 1
Jeremy
  • 41
  • 2

4 Answers4

1

The problem really is that you don't know the full size ahead of time, so you cannot allocate memory appropriately. I would expect that the performance hit you get is related to that, not to the way strings are concatenated since it is efficiently done in the standard library.

Thus, I would recommend deferring concatenation until you know the full size of your final string. That is, you start by storing all your strings in a big vector as in:

using namespace std;
vector<string> allLines;
size_t totalSize = 0;
// If you can have access to the total size of the data you want
// to read (size of the input file, ...) then just initialize totalSize
// and use only the second code snippet below.
for (string line; getline(cin, line); )
{
    allLines.push_back(line);
    totalSize += line.size();
}

Then, you can create your big string knowing its size in advance:

string finalString;
finalString.reserve(totalSize);
for (vector<string>::iterator itS = allLines.begin(); itS != allLines.end(); ++itS)
{
    finalString += *itS;
}

Although, I should mention that you should do that only if you experience performance issues. Don't try to optimize things that do not need to, otherwise you will complicate your program with no noticeable benefit. Places where we need to optimize are often counterintuitive and can vary from environment to environment. So do that only if your profiling tool tells you you need to.

Qortex
  • 5,389
  • 2
  • 30
  • 55
  • 1
    What is that kind of file you can't get the length in bytes? – dtech Apr 05 '13 at 22:08
  • As you can see in OP question, he's using `cin` here which is an example. Also, you can access network resources that you receive piece by piece, and so on, and so on, and so on. (and so on) – Qortex Apr 05 '13 at 22:10
  • @Mic: Did you forget to read the very first line? `I'm reading in a text file of random ascii from an ifstream.` However there are still valid reasons for the constraints. I think ddriver is being rather short-sighted, to be honest. – Lightness Races in Orbit Apr 05 '13 at 22:13
  • @ddriver: `/dev/random` is a file, but what is its size in bytes? – Joker_vD Apr 05 '13 at 22:17
  • @LightnessRacesinOrbit Ok yeah, forgot that, paid attention to the code. Anyway, the OP question suggests he doesn't want to base his solution on file length. Edited my answer in that sense. – Qortex Apr 05 '13 at 22:19
  • Not a linux guy, but yes, it is possible to have a source of unknown size. – dtech Apr 05 '13 at 22:20
  • 2
    Lightness is correct here, my stopping point _is_ truly unknown. I stop based on information I'm gathering from the stream during the read-in (i'm sort of decrypting small groups of these characters in 'real time', to make a long story short! ). So simply making storage space == to the size of the file is wasteful. I've read some other things here that might be useful, thanks for all the input- – Jeremy Apr 05 '13 at 22:37
  • So why didn't you mention that in your question? And why not simply find the "terminating sequence" index in the file and read everything from start to that index at once? Or maybe it is not a file, you just say it is a file and you just post code it is console input? – dtech Apr 05 '13 at 23:11
  • Stop ranting. Everybody seems to have understood it is kind of a theoritical question, and OP will figure out from the answers the way to go with his specific issue. Having a more general problem is more helpful than a precise one in this case, since it might later be of interest for other people. – Qortex Apr 05 '13 at 23:14
0

If you know the file size, use result's member function 'reserve()' once.

dan
  • 965
  • 8
  • 24
  • Knowing the size of the file isn't the same as knowing the length of the content that he's interested in reading in. –  Apr 05 '13 at 22:12
  • @Sancho: Correct. Yet, the latter is what is useful. Reserving the _entire_ filesize is fairly dumb if the file is 600MB and your message length turns out to be 1MB. – Lightness Races in Orbit Apr 05 '13 at 22:16
  • @LightnessRacesinOrbit - "fairly dumb", "rather short-sighted"... condescend much? The OP doesn't seem to know what he wants to do and you come up with amazing scenarios to portray some kind of superiority... Not cool ;) – dtech Apr 05 '13 at 22:31
  • @ddriver: \*shrug\* I'm just saying what I see. Reserving 600x your necessary memory _is_ "fairly dumb", and you _are_ being "rather short-sighted" in your assumptions as to what the OP can and cannot do. I managed to parse the requirements just fine, and [my interpretation has been confirmed](http://stackoverflow.com/questions/15844074/c-overhead-from-string-concatenation/15844149?noredirect=1#comment22546731_15844416). – Lightness Races in Orbit Apr 05 '13 at 22:51
  • @LightnessRacesinOrbit - yeah, yeah, we get it, you are great ;) too bad he didn't confirm the 15 GB input file as well, that would have been a real blast – dtech Apr 05 '13 at 23:00
  • Wow how much debate can come out of an ill-specified problem! @LightnessRacesinOrbit- I agree that reserving memory is "fairly dumb" provided that there actually exists a fairly smart solution – dan Apr 05 '13 at 23:12
0

I'm too sleepy to put together any solid data for you but, ultimately, without knowing the size ahead of time you're always going to have to do something like this. And the truth is that your standard library implementation is smart enough to handle string resizing fairly smartly. (That's despite the fact that there's no exponential growth guarantee for std::string, the way that there is for std::vector.)

So although you may see unwanted re-allocations the first fifty or so iterations, after a while, the re-allocated block becomes so large that re-allocations become rare.

If you profile and find that this is still a bottleneck, perhaps use std::string::reserve yourself with a typical quantity.

Lightness Races in Orbit
  • 358,771
  • 68
  • 593
  • 989
  • Rather than risk a potential reallocation on every concatenation, how about collecting the substrings in a linked list and concatenating them all at the end? Albeit with that solution you're calling `new` on every substring, so it may not work out to be more efficient overall than the less frequent but more costly *reallocations* of std::string. – JBentley Apr 05 '13 at 21:51
  • 2
    The locality of reference means lists are nearly never useful; a rope may make sense, but I seriously doubt the performance issues are due to the memory allocation... – Alex Chamberlain Apr 05 '13 at 21:55
  • @AlexChamberlain: Well the subsequent copy won't be helping – Lightness Races in Orbit Apr 05 '13 at 22:05
  • Regarding complexity - cppreference [says](http://en.cppreference.com/w/cpp/string/basic_string/basic_string) that string constructor taking range has linear complexity, but I don't see confirmation in standard (though for vector it presents). Interesting, is it bug of cppreference? – Evgeny Panasyuk Apr 05 '13 at 22:18
  • @EvgenyPanasyuk: Table 99, row 5. I do find container requirements most sub-optimally laid-out. – Lightness Races in Orbit Apr 05 '13 at 22:27
  • 1
    @LightnessRacesinOrbit, well, I saw this line :) It is not what it looks like - **X(t, m)**. **m** is allocator, not iterator. – Evgeny Panasyuk Apr 05 '13 at 22:34
  • @EvgenyPanasyuk: Oops, yeah. Okay then I have no idea what the standard is doing for this case. `Effects: If InputIterator is an integral type, equivalent to basic_string(static_cast(begin), static_cast(end), a)` LOLWUT – Lightness Races in Orbit Apr 05 '13 at 22:42
  • @LightnessRacesinOrbit, Yeah, I also was confused to see this :) After staring to it several minutes, I come up with idea that it is sepcial handling of case [string('\x5','0')](http://liveworkspace.org/code/2Gw36W), i.e. string(n,ch) – Evgeny Panasyuk Apr 05 '13 at 23:02
  • @EvgenyPanasyuk: But that breaks the semantics of the overload >. – Lightness Races in Orbit Apr 05 '13 at 23:24
0

You're copying the result array for every line in the file (as you expand result). Instead pre-allocate the result and grow it exponentially:

std::string result;
result.reserve(1024); // pre-allocate a typical size

for (std::string line; std::getline(std::cin, line); )
{
    // every time we run out of space, double the available space
    while(result.capacity() < result.length() + line.length())
        result.reserve(result.capacity() * 2);

    result += line;
}
Aaron
  • 8,679
  • 4
  • 38
  • 38