5

I have a pipe-delimited data file with more than 13 columns. The total file size is above 100 MB. I am reading each row, splitting the string into a std::vector<std::string> so I can do calculations. I repeat this process for all the rows in the file like below:

    string filename = "file.dat";
    fstream infile(filename);
    string line;
    while (getline(infile, line)) {
        string item;
        stringstream ss(line);
        vector<string> splittedString;
        while (getline(ss, item, '|')) {
            splittedString.push_back(item);
        }
        int a = stoi(splittedString[0]); 
        // I do some processing like this before some manipulation and calculations with the data
    }

This is however very time consuming and I am pretty sure it is not the most optimized way of reading a CSV-type file. How can this be improved?

update

I tried using the boost::split function instead of a while loop but it is actually even slower.

Community
  • 1
  • 1
user7331538
  • 1,258
  • 1
  • 9
  • 31
  • You could always use a library that is dedicated to reading CSVs – KaiJ Jul 17 '19 at 09:12
  • @KaiJ do you have any suggestions? – user7331538 Jul 17 '19 at 09:14
  • Possible duplicate of [How can I read and parse CSV files in C++?](https://stackoverflow.com/questions/1120140/how-can-i-read-and-parse-csv-files-in-c). Additionally [here](https://softwarerecs.stackexchange.com/a/47525) is a list of C++/CSV libraries. – user1810087 Jul 17 '19 at 09:15
  • 1
    For start move the `std::vector` declaration out of the while loop and clear inside the loop instead. Or if the column count is always the same you could use an `std::array` (also declared outside the loop). Also like already suggested using a library for csv is the best option. Why reinvent the wheel? – GSIO01 Jul 17 '19 at 09:18
  • 1
    What does "very time consuming" mean here? What is the performance you need to achieve? Are the rows independent, so that processing could be done in parallel? – Karsten Koop Jul 17 '19 at 09:22
  • @KarstenKoop parallelism is not an option. Since I am quite rusty on my c++ I was hoping to get some experienced tips on the go-to approach for fast CSV parsing. – user7331538 Jul 17 '19 at 09:26
  • 1
    I can't think of a better way, I personally would code it the same way as yours. – ShockCoding Jul 17 '19 at 09:31
  • @ShockCoding I am pretty sure there is a faster way. my approach above is even slower than a java implementation with bufferedReader – user7331538 Jul 17 '19 at 09:32
  • One thing, if you already know how many values are in each row, you could use an array instead of a vector, slightly faster. – ShockCoding Jul 17 '19 at 09:32
  • 2
    CSV is so trivial to read that reading it is limited only by speed of media even with fastest of SSDs. So your issue is likely in creating dynamic objects in tight loop, other processing like that and that it all is measured in debug build. – Öö Tiib Jul 17 '19 at 09:36

2 Answers2

5

You don't have a CSV file, because CSV stands for Comma-Separated Values, which you don't have.
You have a delimited text file (apparently delimited by a "|"). Parsing CSV is more complicated that simply splitting on ",".

Anyway, without too many dramatic changes to your approach, here are a few suggestions:

  • Use (more) buffering
  • Move vector out of the loop and clear() it in every iteration. That will save on heap reallocations.
  • Use string::find() instead of stringstream to split the string.

Something like this...

using namespace std;
int main() {
    string filename = "file.dat";
    fstream infile(filename);
    char buffer[65536];
    infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
    string line;
    vector<string> splittedString;
    while (getline(infile, line)) {
        splittedString.clear();
        size_t last = 0, pos = 0;
        while ((pos = line.find('|', last)) != std::string::npos) {
            splittedString.emplace_back(line, last, pos - last);
            last = pos + 1;
        }
        if (last)
            splittedString.emplace_back(line, last);
        int a = stoi(splittedString[0]);
        // I do some processing like this before some manipulation and calculations with the data
    }
}
rustyx
  • 62,971
  • 18
  • 151
  • 210
2

You can save another 50% by eliminating "vector splittedString;" and using in-place parsing with strtok_s()

int main() {
auto t1 = high_resolution_clock::now();
long long a(0);

string filename = "file.txt";
fstream infile(filename);
char buffer[65536];
infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
string line;
while (getline(infile, line)) {

    char * pch = const_cast<char*>(line.data());
    char *nextToken = NULL;
    pch = strtok_s(pch, "|", &nextToken);
    while (pch != NULL)
    {
        a += std::stoi(pch);
        pch = strtok_s(NULL, "|", &nextToken);
    }
}

auto t2 = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(t2 - t1).count();
std::cout << duration << "\n";
std::cout << a << "\n";

}

Vlad Feinstein
  • 7,028
  • 1
  • 8
  • 18