Splitting string in C++ takes increasingly more time while lines are approximately similar in length

Question

For my project, I need to read and process a large file containing the energy of seismic receivers. For versatility purposes, It has to be able to handle .dat and .segy files. My problem is with the .dat files. My current implementation splits the string at the '\t' char, puts the match in a substring and pushes the value as a float to an std::vector<float>. The substring and tab are then removed from the line, and the next value is searched. See below:

std::vector<float> parseLine(std::string& number, std::ifstream& file)
{
    getline(file, number); // read the number
    std::vector<float> datalist = selectData(number);

    //for (auto y : datalist) std::cout << y << " ";
    //std::cout << std::endl;
    return datalist;

}


std::vector<float> selectData(std::string& line)
{
    std::vector<float> returnVec;
    //auto parsing_start = std::chrono::high_resolution_clock::now();

    // The question is about this part
    while (true)
    {
        int index = line.find_first_of("\t");
        std::string match = line.substr(0, index);
        if (!line.empty()) {
            returnVec.push_back(std::stof(match));
            line.erase(0, match.length());
        }
        if (line[0] == '\t') line.erase(0,1);
        if (line.empty()) {
            //std::cout << "line is empty" << std::endl; 
            break;
        }

    }   
    return returnVec;
}

Every 100th line, I print the time that has elapsed since the previous 100 line interval. This tells me that the program needs only 1.3s for the first 100 lines, but this steadily increases to over 40s for the final 100 lines (see the figure below). Considering my file has 6000 lines of about 4000 data points, just reading the file takes way too long (about 38 min when I timed it). The lines are all similar in length and composition and I can't understand why this time increases so much. The lines are like this (first 2 columns are coordinates):

400 1   200.0   205.1   80.1    44.5
400 2   250.0   209.1   70.1    40.0

but then of course 4000 columns instead of 6.

Here's the main function, as well as how I measure the time and the #includes:

#include <stdio.h>
#include <fstream>
#include <string>
#include <iostream>
#define _SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING
#include <experimental/filesystem>
#include <regex>
#include <iterator>
#include <chrono>
#include <Eigen/Dense>
#include "readSeis.h"

MatrixXf extractSeismics(std::string file)
{
    MatrixXf M;

    auto start = std::chrono::high_resolution_clock::now();
    auto interstart = std::chrono::high_resolution_clock::now();
    checkExistence(file);
    std::ifstream myfile(file);
    if (!myfile)
    {
        std::cout << "Could not open file " << file << std::endl;
        exit(1);
    }
    int skipCols = 2; // I don't need the coordinates now
    size_t linecount = 0;
    size_t colcount = 0;
    while (!myfile.eof()) // while not at End Of File (eof)
    {
        std::string number;
        std::vector<float> data = parseLine(number, myfile);
        if (linecount == 0)  colcount = data.size() - skipCols;
        //auto resize_start = std::chrono::high_resolution_clock::now();
        M.conservativeResize(linecount + 1, colcount); // preserves old values :)
        //printElapsedTime(resize_start);
        for (int i = skipCols; i < data.size(); i++)
        {
            M(linecount, i - skipCols) = data[i];

        }
        linecount++;
        // Measure interval time
        if (linecount % 100 == 0)
        {
            std::cout << "Parsing line " << linecount << ", ";
            printElapsedTime(interstart);
            interstart = std::chrono::high_resolution_clock::now();
        }
    }
    myfile.close();
    printElapsedTime(start);
    return M;

}

As a side note, I also tried parsing the line with a regular expression, and this resulted in a constant time of 300 ms for every line (giving 30 min for this file). The splitting method is much faster in the beginning (12 ms per line) but much slower at the end (440ms per line). The time increase is linear.

.

For completeness, the output is here:

testSeis1500_1510_290_832.dat exists, continuing program
Parsing line 100, Execution time : 1204968 Microseconds
Parsing line 200, Execution time : 1971723 Microseconds
Parsing line 300, Execution time : 2727474 Microseconds
Parsing line 400, Execution time : 3640131 Microseconds
Parsing line 500, Execution time : 4392584 Microseconds
Parsing line 600, Execution time : 5150465 Microseconds
Parsing line 700, Execution time : 5944256 Microseconds
Parsing line 800, Execution time : 6680841 Microseconds
Parsing line 900, Execution time : 7456237 Microseconds
Parsing line 1000, Execution time : 8201579 Microseconds
Parsing line 1100, Execution time : 8999075 Microseconds
Parsing line 1200, Execution time : 9860883 Microseconds
Parsing line 1300, Execution time : 10524525 Microseconds
Parsing line 1400, Execution time : 11286452 Microseconds
Parsing line 1500, Execution time : 12134566 Microseconds
Parsing line 1600, Execution time : 12872876 Microseconds
Parsing line 1700, Execution time : 13815265 Microseconds
Parsing line 1800, Execution time : 14528233 Microseconds
Parsing line 1900, Execution time : 15221609 Microseconds
Parsing line 2000, Execution time : 15989419 Microseconds
Parsing line 2100, Execution time : 16850944 Microseconds
Parsing line 2200, Execution time : 17717721 Microseconds
Parsing line 2300, Execution time : 18318276 Microseconds
Parsing line 2400, Execution time : 19286148 Microseconds
Parsing line 2500, Execution time : 19828358 Microseconds
Parsing line 2600, Execution time : 20678683 Microseconds
Parsing line 2700, Execution time : 21648089 Microseconds
Parsing line 2800, Execution time : 22229266 Microseconds
Parsing line 2900, Execution time : 23398151 Microseconds
Parsing line 3000, Execution time : 23915173 Microseconds
Parsing line 3100, Execution time : 24523879 Microseconds
Parsing line 3200, Execution time : 25547811 Microseconds
Parsing line 3300, Execution time : 26087140 Microseconds
Parsing line 3400, Execution time : 26991734 Microseconds
Parsing line 3500, Execution time : 27795577 Microseconds
Parsing line 3600, Execution time : 28367321 Microseconds
Parsing line 3700, Execution time : 29127089 Microseconds
Parsing line 3800, Execution time : 29998775 Microseconds
Parsing line 3900, Execution time : 30788170 Microseconds
Parsing line 4000, Execution time : 31456488 Microseconds
Parsing line 4100, Execution time : 32458102 Microseconds
Parsing line 4200, Execution time : 33345031 Microseconds
Parsing line 4300, Execution time : 33853183 Microseconds
Parsing line 4400, Execution time : 34676522 Microseconds
Parsing line 4500, Execution time : 35593187 Microseconds
Parsing line 4600, Execution time : 37059032 Microseconds
Parsing line 4700, Execution time : 37118954 Microseconds
Parsing line 4800, Execution time : 37824417 Microseconds
Parsing line 4900, Execution time : 38756924 Microseconds
Parsing line 5000, Execution time : 39446184 Microseconds
Parsing line 5100, Execution time : 40194553 Microseconds
Parsing line 5200, Execution time : 41051359 Microseconds
Parsing line 5300, Execution time : 41498345 Microseconds
Parsing line 5400, Execution time : 42524946 Microseconds
Parsing line 5500, Execution time : 43252436 Microseconds
Parsing line 5600, Execution time : 44145627 Microseconds
Parsing line 5700, Execution time : 45081208 Microseconds
Parsing line 5800, Execution time : 46072319 Microseconds
Parsing line 5900, Execution time : 46603417 Microseconds
Execution time : 1442777428 Microseconds

Can someone see why this is happening? It would be much appreciated. :)

Sounds like [Schlemiel](https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Painter's_algorithm) is in there somewhere. — Fred Larson, Oct 11 '19 at 13:51
I'd also guess that adding the values to M is the actual culprit here. You should probably have a look with your profiler of choice or just comment out the for loop in extractSeismics — MikeMB, Oct 11 '19 at 13:56
`while (!myfile.eof())`. [See this](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons). Second, a parse routine seems flawed if you're calling `erase()` to perform parsing. Why are you erasing parts of the line? (note what this does -- the entire portion of the string has to be moved "down" to cover up the hole from the erase() call). Last, please post your compiler settings, more to the point, the optimization settings you used when you built your application. — PaulMcKenzie, Oct 11 '19 at 13:58
@MikeMB It's not the loop that should be commented out for that test, but the line that increases `linecount`. (That is the line that causes the resizes to take more and more time with each iteration.) Of course, that messes with getting the timings, so... well, should not be too hard to adjust? — JaMiT, Oct 11 '19 at 14:09
If I had to guess, I'd suspect `M.conservativeResize` call. I bet it's linear in the current size of the matrix. — Igor Tandetnik, Oct 11 '19 at 14:14
Change your parameter of the `selectData` function to `const std::string& line`. This will somewhat force you to write the function without erasure of the passed-in line of data. Using `find_first_of` and specifying the starting position instead of erasing parts of the line is what I would expect. — PaulMcKenzie, Oct 11 '19 at 14:18
This shouldn't have anything to do with the string parsing; each line is likely similar size, so while you're doing it inefficiently, it should be *equally* inefficient all through. I'm with @IgorTandetnik that the likely culprit would be something that's *growing* as you go, like the `M.conservativeResize` call; with a name like "conservative", it may be resizing the minimal amount required, rather than overallocating; overallocating by a multiple would be amortized `O(1)` per expansion, minimal resize would be `O(n)` per expansion. — ShadowRanger, Oct 11 '19 at 14:25
There must be something very terrible happen. If you need 1.2seconds! to parse 100 lines, you already have a fundamental problem. I can't believe that any real world application should take more than a few ms to read 100 lines even if running on a embedded system like raspy. — Klaus, Oct 12 '19 at 08:23
Thanks for the useful comments! I have accepted the answer below. The problem of having to resize the matrix (because I do not know the dimensions at runtime) is solved by using the standard vector with vectors , as they do not need to be initiated with a size :) Long story short, I've learned a lot about efficiency in this post! — L. van Agtmaal, Oct 12 '19 at 19:21

Jerry Coffin · Accepted Answer · 2019-10-12T08:05:40.797

Here's some code to read a file roughly as you described. It reads a line at a time, parses out the floats in a line, skips the first N columns, and puts the rest into a vector<float>. The main function stores each row into a vector<vector<float>>, and (to assure the rest doesn't get optimized out) adds up all the values it read, and at the end prints them out.

#include <iostream>
#include <sstream>
#include <vector>
#include <iterator>
#include <numeric>
#include <fstream>

std::vector<float> selectData(std::string const &line, int skip_count) { 
    std::istringstream buffer(line);

    float ignore;
    for (int i=0; i<skip_count; i++)
        buffer >> ignore;

    return std::vector<float>{std::istream_iterator<float>(buffer), {}};
}

int main(int argc, char **argv) { 
    std::string line;
    float total = 0.0f;

    if (argc != 2) {
        std::cerr << "Usage: accum <infile>\n";
        return EXIT_FAILURE;
    }

    std::vector<std::vector<float>> matrix;

    std::ifstream infile(argv[1]);
    while (std::getline(infile, line)) {
        auto vals = selectData(line, 2);
        matrix.push_back(vals);
        total += std::accumulate(vals.begin(), vals.end(), 0.0f);
    }
    std::cout << "total: " <<total << "\n";
}

On my machine, this reads a file of 6000 lines with 4000 numbers per line in about 22 seconds (but, as they say, your mileage may vary--my machine is fairly old; on a newer machine, doubling this speed wouldn't surprise me at all). This is open to further improvement, but at a guess reducing the read time from 38 minutes to 22 seconds (or so) is probably enough that further improvement isn't a high priority.

Okay, I see how this works. This reduces the execution time from 38 minutes to 73.4 seconds, already much better! I'll see if I can improve more. Probably it'll run faster on my work computer (the 73 seconds is on my laptop with medium specs) — L. van Agtmaal, Oct 12 '19 at 20:50

Splitting string in C++ takes increasingly more time while lines are approximately similar in length

1 Answers1