0

Is there a way rather than using getline() for each row in a csv, to instead read in a larger chunk, say 10000 rows into a string? The idea would be then to write code which splits the string into substrings/put elements into desired array/vectors.

Currently loading csvs (50-1500mb) is taking 5 mins ++, from trawling related questions it seems the bottleneck is calling getline() / the system calls are what is causing the slowness?

I'm a c++ newb so if anyone knows a better solution that would be appreciated!

This is my current slow code if it helps:

    while (!myFile.eof())
    {
            string aLine; //holds in read in line
            getline(myFile, aLine); //reads line from file into aLine

            std::string input = aLine;
            std::istringstream ss(input);
            std::string token;

            while (std::getline(ss, token, ',')) {
                t++;
                if (t == 2) {
                    y.push_back(0);
                    y[i] = atof(token.c_str());
                    cout << y[i] << endl;
                }
            }
            t = 0;
        i++;
    }

EDIT: Thanks John Zwinck, the time has decreased from 232.444 seconds to 156.248. Also thanks Richard Critten, I will update the time elapsed using memory maps with boost.

squshy
  • 3
  • 3
  • http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong – πάντα ῥεῖ Jul 10 '16 at 09:35
  • Platform specific: memory map the entire file and treat it as an array of char. – Richard Critten Jul 10 '16 at 09:42
  • Considering the number of times `t` could possibly be `2` in that inner token search (once), I'm surprised you keep right on marching through that line looking for more commas until exhausted. wtb a `break`. And you could certainly do without the needless copy of `std::string input = aLine;`, rather just using `aLine` in your stream constructor. – WhozCraig Jul 10 '16 at 09:45
  • thanks, looking at memory maps now. WhozCraig ~ aha didn't spot that, it was a temporary measure, was planning on using the other later columns. – squshy Jul 10 '16 at 09:49

1 Answers1

2

The biggest performance problems in your code are, in order of severity:

  • Excessive memory allocations.
  • Unnecessary use of stringstream.
  • Failure to short-circuit after t == 2.
  • Unnecessary flush of cout (platform dependent).

Something like this should be a lot faster:

y.reserve(1000);
for (string aLine; getline(myFile, aLine); ) {
    string::size_type comma = aLine.find(',');
    if (comma == string::npos)
        continue;

    y.push_back(atof(aLine.c_str() + comma + 1));
    cout << y.back() << '\n';
}
John Zwinck
  • 207,363
  • 31
  • 261
  • 371
  • @squshy: You're welcome. Please try out the above and post a comment to let us know if it's faster now, and by how much. And if this answer solved your problem, you can "accept" it by clicking the checkmark on the left. Welcome to Stack Overflow. – John Zwinck Jul 10 '16 at 11:59
  • @Bob__: Look where OP's code has `if (t == 2)`. I think you might agree that code skips all the elements other than the second. – John Zwinck Jul 10 '16 at 12:31