3

I'm writing a program that involves analyzing CSV files of minimum 0.5GB (and maximum of over 20GB), I read from the CSV as follows with fstream, while (getline(fin,line)) {}, and doing an average of 17millisecs work on each comma separated record. Simple stuff.

But, there are a LOT of records. So obviously, the program is I/O bound, but I was wondering whether I could improve the I/O performance. I can't resort to OpenMP as I would deal with CPU constraints, and buffering a file this large won't work either. So I might need some kind of pipeline...

I have VERY little experience in multithreading in C++ and have never used dataflow frameworks. Could anyone point me in the right direction?


Update (12/23/14) :

Thanks for all your comments. You are right, 17ms was a bit much... After doing a LOT of profiling (oh, the pain), I isolated the bottleneck as an iteration over a substring in each record (75 chars). I experimented with #pragmas but it simply isn't enough work to parallelize. the overhead of the function call was the main gripe - now 5.41μs per record, having shifted a big block. It's ugly, but faster.

Thanks @ChrisWard1000 for your suggestions. Unfortunately I do not much have control over the hardware I'm using at the moment, but will profile with larger data sets (>20GB CSV) and see how I could introduce mmap/multithreaded parsing etc.

PidgeyBAWK
  • 309
  • 1
  • 3
  • 13
  • 3
    Set up a profiling environment to find out where your bottlenecks are. Just guessing, you could probably speed up your processing using a second thread, because 17ms is a looong time for a halfway modern computer, and in that time it's CPU-bound. – Ulrich Eckhardt Dec 23 '14 at 10:08
  • You definitely need to setup a profiling environment. You need to know how long it takes to sequentially read the entire file raw without intermediate processing. The time to get the data thru memory so you can do something else with it is going to be capped at that speed (i.e., you simply can't slurp data faster than your spindles can deliver it). If your current overall-time is significantly *larger* than that minimal read-time, there is potentially an area for improvement by offloading the read into async-io and dispatching the actual processing to a worker thread or pool. – WhozCraig Dec 23 '14 at 10:19
  • The OS reads ahead for you. Often, a read call is just a memcpy. But it seems you need to prove that you are IO bound first. At 160 records per second it does not seem like it. – usr Dec 23 '14 at 11:20

3 Answers3

8

17ms per record is extremely high, it should not be difficult to improve upon that, unless you are using some seriously antiquated hardware.

  1. Upgrade the hardware. SSD's, RAID striping and PCI express hard disks are designed for this kind of activity.

  2. Read the file in larger chunks at a time, reducing I/O waiting times. Perhaps use fread to dump large chunks to memory first.

  3. Consider using mmap to map a pointer between hard disk and memory.

  4. Most importantly profile your code to see where the delays are. This is notoriously difficult with I/O activity because it differs between machines and it often varies significantly at runtime.

  5. You could attempt to add multithreaded parsing, however I strongly suggest you try this as a last resort, and understand that it will likely be the cause of a lot of pain and suffering.

ChrisWard1000
  • 496
  • 5
  • 15
  • Thank you for your many suggestions! See the update on op, I have dramatically reduced the 17ms and now I simply need to experiment with larger data sets to check true I/O performance. – PidgeyBAWK Dec 23 '14 at 13:19
0

getline probably introduces some CPU overhead that may hurt your performance, but ultimately, if you exhaust the reading speed of your HDD, no pipeline, multithreading, or anything else will help you. Only increasing the I/O bandwidth will help you then, and that's a hardware issue (e.g. put it on a RAID0, collect parts from the network instead, etc.).

Sebastian Redl
  • 61,331
  • 8
  • 105
  • 140
0

The trouble with most profilers, as you've found out, is either they 1) ignore your I/O, or they 2) only give you function-level timing, not line-level.

A very simple method gives you both, shown here.

Your program should be I/O bound, meaning if you pause it 10 times, nearly every time you will see it deep in the process of getting the next record.

If you are only processing 160 records per second you are not I/O bound, you are CPU bound, and nearly every pause will be pointing into your parsing or whatever. For example, you might be newing (and later deleting) lots of objects. If so, re-use them.

Whatever it is pointing at, find a way to reduce or eliminate that activity. That will speed you up.

Rinse and repeat. When you're I/O bound you can stop.

Community
  • 1
  • 1
Mike Dunlavey
  • 38,662
  • 12
  • 86
  • 126