For my project, I need to read and process a large file containing the energy of seismic receivers. For versatility purposes, It has to be able to handle .dat and .segy files. My problem is with the .dat
files. My current implementation splits the string at the '\t'
char, puts the match in a substring and pushes the value as a float to an std::vector<float>
. The substring and tab are then removed from the line, and the next value is searched. See below:
std::vector<float> parseLine(std::string& number, std::ifstream& file)
{
getline(file, number); // read the number
std::vector<float> datalist = selectData(number);
//for (auto y : datalist) std::cout << y << " ";
//std::cout << std::endl;
return datalist;
}
std::vector<float> selectData(std::string& line)
{
std::vector<float> returnVec;
//auto parsing_start = std::chrono::high_resolution_clock::now();
// The question is about this part
while (true)
{
int index = line.find_first_of("\t");
std::string match = line.substr(0, index);
if (!line.empty()) {
returnVec.push_back(std::stof(match));
line.erase(0, match.length());
}
if (line[0] == '\t') line.erase(0,1);
if (line.empty()) {
//std::cout << "line is empty" << std::endl;
break;
}
}
return returnVec;
}
Every 100th line, I print the time that has elapsed since the previous 100 line interval. This tells me that the program needs only 1.3s for the first 100 lines, but this steadily increases to over 40s for the final 100 lines (see the figure below). Considering my file has 6000 lines of about 4000 data points, just reading the file takes way too long (about 38 min when I timed it). The lines are all similar in length and composition and I can't understand why this time increases so much. The lines are like this (first 2 columns are coordinates):
400 1 200.0 205.1 80.1 44.5
400 2 250.0 209.1 70.1 40.0
but then of course 4000 columns instead of 6.
Here's the main function, as well as how I measure the time and the #include
s:
#include <stdio.h>
#include <fstream>
#include <string>
#include <iostream>
#define _SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING
#include <experimental/filesystem>
#include <regex>
#include <iterator>
#include <chrono>
#include <Eigen/Dense>
#include "readSeis.h"
MatrixXf extractSeismics(std::string file)
{
MatrixXf M;
auto start = std::chrono::high_resolution_clock::now();
auto interstart = std::chrono::high_resolution_clock::now();
checkExistence(file);
std::ifstream myfile(file);
if (!myfile)
{
std::cout << "Could not open file " << file << std::endl;
exit(1);
}
int skipCols = 2; // I don't need the coordinates now
size_t linecount = 0;
size_t colcount = 0;
while (!myfile.eof()) // while not at End Of File (eof)
{
std::string number;
std::vector<float> data = parseLine(number, myfile);
if (linecount == 0) colcount = data.size() - skipCols;
//auto resize_start = std::chrono::high_resolution_clock::now();
M.conservativeResize(linecount + 1, colcount); // preserves old values :)
//printElapsedTime(resize_start);
for (int i = skipCols; i < data.size(); i++)
{
M(linecount, i - skipCols) = data[i];
}
linecount++;
// Measure interval time
if (linecount % 100 == 0)
{
std::cout << "Parsing line " << linecount << ", ";
printElapsedTime(interstart);
interstart = std::chrono::high_resolution_clock::now();
}
}
myfile.close();
printElapsedTime(start);
return M;
}
As a side note, I also tried parsing the line with a regular expression, and this resulted in a constant time of 300 ms for every line (giving 30 min for this file). The splitting method is much faster in the beginning (12 ms per line) but much slower at the end (440ms per line). The time increase is linear.
For completeness, the output is here:
testSeis1500_1510_290_832.dat exists, continuing program
Parsing line 100, Execution time : 1204968 Microseconds
Parsing line 200, Execution time : 1971723 Microseconds
Parsing line 300, Execution time : 2727474 Microseconds
Parsing line 400, Execution time : 3640131 Microseconds
Parsing line 500, Execution time : 4392584 Microseconds
Parsing line 600, Execution time : 5150465 Microseconds
Parsing line 700, Execution time : 5944256 Microseconds
Parsing line 800, Execution time : 6680841 Microseconds
Parsing line 900, Execution time : 7456237 Microseconds
Parsing line 1000, Execution time : 8201579 Microseconds
Parsing line 1100, Execution time : 8999075 Microseconds
Parsing line 1200, Execution time : 9860883 Microseconds
Parsing line 1300, Execution time : 10524525 Microseconds
Parsing line 1400, Execution time : 11286452 Microseconds
Parsing line 1500, Execution time : 12134566 Microseconds
Parsing line 1600, Execution time : 12872876 Microseconds
Parsing line 1700, Execution time : 13815265 Microseconds
Parsing line 1800, Execution time : 14528233 Microseconds
Parsing line 1900, Execution time : 15221609 Microseconds
Parsing line 2000, Execution time : 15989419 Microseconds
Parsing line 2100, Execution time : 16850944 Microseconds
Parsing line 2200, Execution time : 17717721 Microseconds
Parsing line 2300, Execution time : 18318276 Microseconds
Parsing line 2400, Execution time : 19286148 Microseconds
Parsing line 2500, Execution time : 19828358 Microseconds
Parsing line 2600, Execution time : 20678683 Microseconds
Parsing line 2700, Execution time : 21648089 Microseconds
Parsing line 2800, Execution time : 22229266 Microseconds
Parsing line 2900, Execution time : 23398151 Microseconds
Parsing line 3000, Execution time : 23915173 Microseconds
Parsing line 3100, Execution time : 24523879 Microseconds
Parsing line 3200, Execution time : 25547811 Microseconds
Parsing line 3300, Execution time : 26087140 Microseconds
Parsing line 3400, Execution time : 26991734 Microseconds
Parsing line 3500, Execution time : 27795577 Microseconds
Parsing line 3600, Execution time : 28367321 Microseconds
Parsing line 3700, Execution time : 29127089 Microseconds
Parsing line 3800, Execution time : 29998775 Microseconds
Parsing line 3900, Execution time : 30788170 Microseconds
Parsing line 4000, Execution time : 31456488 Microseconds
Parsing line 4100, Execution time : 32458102 Microseconds
Parsing line 4200, Execution time : 33345031 Microseconds
Parsing line 4300, Execution time : 33853183 Microseconds
Parsing line 4400, Execution time : 34676522 Microseconds
Parsing line 4500, Execution time : 35593187 Microseconds
Parsing line 4600, Execution time : 37059032 Microseconds
Parsing line 4700, Execution time : 37118954 Microseconds
Parsing line 4800, Execution time : 37824417 Microseconds
Parsing line 4900, Execution time : 38756924 Microseconds
Parsing line 5000, Execution time : 39446184 Microseconds
Parsing line 5100, Execution time : 40194553 Microseconds
Parsing line 5200, Execution time : 41051359 Microseconds
Parsing line 5300, Execution time : 41498345 Microseconds
Parsing line 5400, Execution time : 42524946 Microseconds
Parsing line 5500, Execution time : 43252436 Microseconds
Parsing line 5600, Execution time : 44145627 Microseconds
Parsing line 5700, Execution time : 45081208 Microseconds
Parsing line 5800, Execution time : 46072319 Microseconds
Parsing line 5900, Execution time : 46603417 Microseconds
Execution time : 1442777428 Microseconds
Can someone see why this is happening? It would be much appreciated. :)