Read large CSV file in C++(~4GB)

Question

I want to read and store a large CSV file into a map. I started by just reading the file and seeing how long it takes to process. This is my loop:

while(!gFile.eof()){
   gFile >> data;
}

It is taking me ~35 mins to process the csv file that contains 35 million lines and six columns. Is there any way to speed this up? Pretty new to SO, so apologies if not asking correctly.

I'm pretty sure there are a few libraries expressly for this purpose. That said, 35 minuts sounds a little too long. How much other processing are you performing? Make sure you enabled optimizations! — tambre, Aug 24 '17 at 15:28
If you're re-writing the same variable over and over, it's not clear why this takes so long. How long does it take to copy the file? — tadman, Aug 24 '17 at 15:29
Possible duplicate of [Read whole ASCII file into C++ std::string](https://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring) — smac89, Aug 24 '17 at 15:29
Also Possible duplicate of [How can I read and parse CSV files in C++?](https://stackoverflow.com/questions/1120140/how-can-i-read-and-parse-csv-files-in-c?rq=1) — smac89, Aug 24 '17 at 15:30
Please read [why is while feof is always wrong](https://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong) — Ed Heal, Aug 24 '17 at 15:34
Id use a library unless you want to dig into some very low level, and likely platform specific code. Generally Ive found that that C++ file streams are vastly slower than whats possible, but also often platform dependent (e.g. letting the system know you intend to sequentially scan the file, and letting it keep reading the disk while your parsing some data, in order to reach the disks potential max read speed), — Fire Lancer, Aug 24 '17 at 15:45
Read blocks into memory, then parse the text in memory. File input likes to be continuous, not interrupted. On technique is to have one thread read into a buffer and another thread process the data. Search the internet for "double buffering". — Thomas Matthews, Aug 24 '17 at 15:50
If you have control over the producer of the data, make the data in fixed sized records (and columns). This will eliminate the need for searching for a column delimiter; you can use math to advance to the next column or record. — Thomas Matthews, Aug 24 '17 at 15:51
Search the internet for "data cache efficient c++" for more information about optimizing your program to make efficient use of the processor's data cache. — Thomas Matthews, Aug 24 '17 at 15:52
@Thomas, I think 1 thread is generally a lot easier even if you have to step outside `fstream` and standard C++ (or use a wrapper/lib) then multithreading (assuming your io, not compute bound). e.g. use Win32 async ("overlapped") reads or open the file with `FILE_FLAG_SEQUENTIAL_SCAN` or `POSIX_FADV_SEQUENTIAL`. — Fire Lancer, Aug 24 '17 at 16:04
@FireLancer: Memory mapped files are easier than multiple threads. You let the OS manage the memory and when to read the file. — Thomas Matthews, Aug 24 '17 at 16:15

Thomas Matthews · Answer 1 · 2017-08-24T16:05:48.517

Background
Files are stream devices or concepts. The most efficient usage of reading a file is to keep the data streaming (flowing). For every transaction there is an overhead. The larger the data transfer, the less impact the overhead has. So, the goal is to keep the data flowing.

Memory faster than file access
Search memory is many times faster than searching a file. So, searching for a "word" or delimiter is going to be faster than reading a file character by character to find the delimiter.

Method 1: Line by line
Using std::getline is much faster than using operator>>. Although the input code may read a block of data; you are only performing one transaction to read a record versus one transaction per column. Remember, keep data flowing and searching memory for the columns is faster.

Method 2: Block reading
In the spirit of keeping the stream flowing, read a block of memory into a buffer (large buffer). Process the data from the buffer. This is more efficient than reading line by line because you can read in multiple lines of data with one transaction, reducing the overhead of a transaction.

One caveat is that you may have a record cross buffer boundaries, so you'll need to come up with an algorithm to handle that. The execution penalty is small and only happens once per transaction (consider this part of the overhead of a transaction).

Method 3: Multiple threads
In the spirit of keeping the data streaming, you could create multiple threads. One thread is in charge or reading the data into a buffer while another thread processes the data from the buffer. This technique will have better luck keeping the data flowing.

Method 4: Double buffering & multiple threads
This takes Method 3 above and adds multiple buffers. The reading thread can fill up one buffer then start filling a second buffer. The data processing thread will wait until the first buffer is filled before processing the data. This technique is used to better match the speed of reading data to the speed of processing the data.

Method 5: Memory mapped files
With a memory mapped file, the operating system handles the reading of the file into memory on demand. Less code that you have to write, but you don't get as much control over when the file is read into memory. This is still faster than reading field by field.

Id avoid multiple threads unless Im compute bound, thats often a bunch of extra unneeded complexity. Wraping Win32 `CreateFile`+`ReadFile` or nix `open`+`read` is really not hard, and can read at the full disk speed (even better than memory mapped, because you can tell the OS your reading sequentially), somthing c++ does not even garuntee with a "dedicated" thread. For simple CSV etc. ive even kept up with SSD's potential read speed without threading (several hundred MB a second). — Fire Lancer, Aug 24 '17 at 16:08
Downvoters: Please add comment explaining your downvote. I have personally implemented these techniques and found significant performance improvements when processing data files over 1GB in size. — Thomas Matthews, Aug 24 '17 at 16:08

score 2 · Answer 2 · answered Aug 24 '17 at 22:02

Lets start with the bottlenecks.

Reading from disk
Decoding the data
Store in map
Memory speed
Amount of memory

Read from disk

Read till you drop, if you can't read fast enough to use all bandwidth on the disk you can go faster. Ignore all other steps and only read.
Start by adding buffers to your instream
Set hints for reading
use mmap
4GB is a trivial size, if you don't already have 32 GB upgrade
Too slow buy M.2 disk.
Still to slow then more exotic, change disk driver, dump OS. Mirror disks, only you $£€ is the limit.

Decode the data

if you data is in lines where are all the same length then all decodes can be done in parallel, limited only by memory bandwidth.
if the line lengths only wary a little the find end of line can be done in parallel followed by parallel decode.
if the order of the lines doesn't matter for the final map just split the file in #hardwarethreads parts and let each process their part until the first newline in the next threads part.
memory bandwidth will most likely be reach far before the CPU is anyway near used up.

Store in map

hopefully you have thought about this map in advance as none of the std maps are thread safe.
if you don't care about order a std::array can be used and you can run at full memory bandwidth.
lets say you want to use std::unordered_map, there is a problem that it needs to update the size after each write, so effectively your are limited to 1 thread writing to it.
You could use 1 thread at a time to update while the other precompute the hash of the record.
having one thread write has the problem that nearly every write will be a cache miss severely limiting speed.
so if that is not fast enough roll your own hash_map, without a size that must be updated every write.
to ensure thread safety you also need to protect the write, having one mutex makes you as slow or slower than the single writer.
you could try to make it lock and wait free ... if your not an expert you will get severe headache instead.
if you have selected a bucket design for you hash then you could make X times number of writer threads mutexes, use the hash value to select the mutex. The extra mutexes increase the likelihood that two threads won't collide.

Memory speed

Each line will be transferred at least 4 times over the memory bus, once from the disk to ram (at least once more if the driver is not good), once when the data is decoded, once when the map makes a reads request, and one more for when the map writes.
A good setup can save one more memory access if the driver writes to cache and therefore decode not will result in a LLC-miss.

Amount of memory

you should have enough memory to hold the total file, the data structure and some intermediate data.
Check if RAM is cheaper than your programming time.

Read large CSV file in C++(~4GB)

2 Answers2