The fastest way to read csv file in c++ which contains large no of columns and rows

Question

I have a pipe-delimited data file with more than 13 columns. The total file size is above 100 MB. I am reading each row, splitting the string into a std::vector<std::string> so I can do calculations. I repeat this process for all the rows in the file like below:

    string filename = "file.dat";
    fstream infile(filename);
    string line;
    while (getline(infile, line)) {
        string item;
        stringstream ss(line);
        vector<string> splittedString;
        while (getline(ss, item, '|')) {
            splittedString.push_back(item);
        }
        int a = stoi(splittedString[0]); 
        // I do some processing like this before some manipulation and calculations with the data
    }

This is however very time consuming and I am pretty sure it is not the most optimized way of reading a CSV-type file. How can this be improved?

update

I tried using the boost::split function instead of a while loop but it is actually even slower.

You could always use a library that is dedicated to reading CSVs — KaiJ, Jul 17 '19 at 09:12
Possible duplicate of [How can I read and parse CSV files in C++?](https://stackoverflow.com/questions/1120140/how-can-i-read-and-parse-csv-files-in-c). Additionally [here](https://softwarerecs.stackexchange.com/a/47525) is a list of C++/CSV libraries. — user1810087, Jul 17 '19 at 09:15
For start move the `std::vector` declaration out of the while loop and clear inside the loop instead. Or if the column count is always the same you could use an `std::array` (also declared outside the loop). Also like already suggested using a library for csv is the best option. Why reinvent the wheel? — GSIO01, Jul 17 '19 at 09:18
What does "very time consuming" mean here? What is the performance you need to achieve? Are the rows independent, so that processing could be done in parallel? — Karsten Koop, Jul 17 '19 at 09:22
@KarstenKoop parallelism is not an option. Since I am quite rusty on my c++ I was hoping to get some experienced tips on the go-to approach for fast CSV parsing. — user7331538, Jul 17 '19 at 09:26
I can't think of a better way, I personally would code it the same way as yours. — ShockCoding, Jul 17 '19 at 09:31
@ShockCoding I am pretty sure there is a faster way. my approach above is even slower than a java implementation with bufferedReader — user7331538, Jul 17 '19 at 09:32
One thing, if you already know how many values are in each row, you could use an array instead of a vector, slightly faster. — ShockCoding, Jul 17 '19 at 09:32
CSV is so trivial to read that reading it is limited only by speed of media even with fastest of SSDs. So your issue is likely in creating dynamic objects in tight loop, other processing like that and that it all is measured in debug build. — Öö Tiib, Jul 17 '19 at 09:36

rustyx · Accepted Answer · 2019-07-17T10:00:03.000

You don't have a CSV file, because CSV stands for Comma-Separated Values, which you don't have.
You have a delimited text file (apparently delimited by a "|"). Parsing CSV is more complicated that simply splitting on ",".

Anyway, without too many dramatic changes to your approach, here are a few suggestions:

Use (more) buffering
Move vector out of the loop and clear() it in every iteration. That will save on heap reallocations.
Use string::find() instead of stringstream to split the string.

Something like this...

using namespace std;
int main() {
    string filename = "file.dat";
    fstream infile(filename);
    char buffer[65536];
    infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
    string line;
    vector<string> splittedString;
    while (getline(infile, line)) {
        splittedString.clear();
        size_t last = 0, pos = 0;
        while ((pos = line.find('|', last)) != std::string::npos) {
            splittedString.emplace_back(line, last, pos - last);
            last = pos + 1;
        }
        if (last)
            splittedString.emplace_back(line, last);
        int a = stoi(splittedString[0]);
        // I do some processing like this before some manipulation and calculations with the data
    }
}

This actually reduced the processing time by 63%. Thanks – user7331538 Jul 17 '19 at 10:20 — user7331538, Jul 17 '19 at 10:20

score 2 · Answer 2 · answered Aug 06 '19 at 21:25

You can save another 50% by eliminating "vector splittedString;" and using in-place parsing with strtok_s()

int main() {
auto t1 = high_resolution_clock::now();
long long a(0);

string filename = "file.txt";
fstream infile(filename);
char buffer[65536];
infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
string line;
while (getline(infile, line)) {

    char * pch = const_cast<char*>(line.data());
    char *nextToken = NULL;
    pch = strtok_s(pch, "|", &nextToken);
    while (pch != NULL)
    {
        a += std::stoi(pch);
        pch = strtok_s(NULL, "|", &nextToken);
    }
}

auto t2 = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(t2 - t1).count();
std::cout << duration << "\n";
std::cout << a << "\n";

}

The fastest way to read csv file in c++ which contains large no of columns and rows

update

2 Answers2