I'm writing a program that involves analyzing CSV files of minimum 0.5GB (and maximum of over 20GB), I read from the CSV as follows with fstream
, while (getline(fin,line)) {}
, and doing an average of 17millisecs work on each comma separated record. Simple stuff.
But, there are a LOT of records. So obviously, the program is I/O bound, but I was wondering whether I could improve the I/O performance. I can't resort to OpenMP as I would deal with CPU constraints, and buffering a file this large won't work either. So I might need some kind of pipeline...
I have VERY little experience in multithreading in C++ and have never used dataflow frameworks. Could anyone point me in the right direction?
Update (12/23/14) :
Thanks for all your comments. You are right, 17ms was a bit much... After doing a LOT of profiling (oh, the pain), I isolated the bottleneck as an iteration over a substring in each record (75 chars). I experimented with #pragmas
but it simply isn't enough work to parallelize. the overhead of the function call was the main gripe - now 5.41μs per record, having shifted a big block. It's ugly, but faster.
Thanks @ChrisWard1000 for your suggestions. Unfortunately I do not much have control over the hardware I'm using at the moment, but will profile with larger data sets (>20GB CSV) and see how I could introduce mmap/multithreaded parsing etc.