7

I have to read a 8192x8192 matrix into memory. I want to do it as fast as possible.
Right now I have this structure:

char inputFile[8192][8192*4]; // I know the numbers are at max 3 digits
int8_t matrix[8192][8192]; // Matrix to be populated

// Read entire file line by line using fgets
while (fgets (inputFile[lineNum++], MAXCOLS, fp));

//Populate the matrix in parallel, 
for (t = 0; t < NUM_THREADS; t++){
    pthread_create(&threads[t], NULL, ParallelRead, (void *)t);
}

In the function ParallelRead, I parse each line, do atoi and populate the matrix. The parallelism is line-wise like thread t parses line t, t+ 1 * NUM_THREADS..

On a two-core system with 2 threads, this takes

Loading big file (fgets) : 5.79126
Preprocessing data (Parallel Read) : 4.44083

Is there a way to optimize this any further?

genpfault
  • 47,669
  • 9
  • 68
  • 119
sud03r
  • 17,337
  • 14
  • 72
  • 94
  • 3
    Perhaps you could start the populating threads in parallel with the i/o, as enough data becomes available. – vanza May 20 '12 at 19:06
  • To be honest, I am a bit surprised you've managed to get *any* performance improvement out of reading the same file from multiple threads... When benchmarking, are you making sure the file is actually read from the disk, and not from the cache? – NPE May 20 '12 at 19:08
  • @aix i hav used 2 threads just for example. I have parallelized the preprocessing part, this is after data is read into memory. – sud03r May 20 '12 at 19:16
  • The best you can possibly get is less than twice as fast, even if you can make the `Preprocessing` take 0 time. Is it worth it? (See Amdahl's law.) – Alan Stokes May 20 '12 at 19:16
  • @AlanStokes I see no harm in saving 4.44 seconds if I have the resources. I actually was able to reduce the preprocessing time to 0.1 on a machine with 40 cores. – sud03r May 20 '12 at 19:21
  • @vanza you may be right, this was something i was also thinking of. I could let fgets() carry on in one thread and start the preprocessing task in another thread, but that will require my threads to wait if the data is not available, and I've seen performance penalties with that. – sud03r May 20 '12 at 19:25
  • 2
    The only ways I know to improve disk read performance are: 1) read the data from a compressed source. 2) use faster disks, or RAID array. or 3) split the data onto separate disks and read 1 thread per disk. Usually, if a single thread can't keep up with your disk read time, you have big problems. – mfa May 20 '12 at 19:25
  • 1
    Store your data in binary. If each matrix element can take at most 256 different values, we're looking at 64MB here, which should be easily processable by modern hardware. You can then also memory-map the file directly into your program. – Kerrek SB May 20 '12 at 19:44
  • @KerrekSB you mean to say that the input file which i am reading should be binary? I can't modify the format of input file. – sud03r May 20 '12 at 20:05
  • @sud03r: You could write a conversion tool, though, if this would improve your overall operational efficiency. – Kerrek SB May 20 '12 at 20:17

4 Answers4

30

It's a bad idea to do it this way. Threads can get your more cpu cycles if you have enough cores but you still have only one hard disk. So inevitably threads cannot improve the speed of reading file data.

They actually make it much worse. Reading data from a file is fastest when you access the file sequentially. That minimizes the number of reader head seeks, by far the most expensive operation on a disk drive. By splitting the reading across multiple threads, each reading a different part of the file, you are making the reader head constantly jump back and forth. Very, very bad for throughput.

Use only one thread to read file data. You might be able to overlap it with some computational cycles on the file data by starting a thread once a chunk of the file data is loaded.

Do watch out for the test effect. When you re-run your program, typically after tweaking your code somewhat, it is likely that the program can find file data back in the file system cache so it doesn't have to be read from the disk. That's very fast, memory bus speed, a memory-to-memory copy. Pretty likely on your dataset since it isn't very big and easily fits in the amount of RAM a modern machine has. This does not (typically) happen on a production machine. So be sure to clear out the cache to get realistic numbers, whatever it takes on your OS.

Hans Passant
  • 873,011
  • 131
  • 1,552
  • 2,371
  • 2
    he is _not_ reading the file in parallel, he is converting the string to int8_t`s in parallel from memory. There's nothing wrong with that. – kratenko May 21 '12 at 19:23
  • I never claimed there was anything wrong with that. I fact recommended to overlap that with the thread that reads the data. – Hans Passant Jun 21 '13 at 20:30
2

One thing worth considering is allocating two smaller input buffers (say they will be 200 lines each).

Then have one thread read data into the input buffers. When one input buffer is full, pass it to a second thread that does the parsing. This second thread could use a thread pool for concurrent parsing (check openMP).

You will have to use locks/mutexes to ensure that either thread has exclusive access.

This would be better because the parsing now is concurrent with reading the file, and you memory access to the buffer is more local and will fit into your CPU cache. This can improve reading and parsing speed.

If fgets is the bottleneck, you can also read the file into memory as binary. This could improve read speed, but will require you to do extra parsing and will make the abovementioned optimization harder to carry out.

2

Try a parent thread that loads the character array using something like fread to load everything in 1 io as a great big string.

Have the parent walk the string, and find 1 line, or calculate where the first line is based on sizes. Hand the processing of that line off to a thread. Next Line, Rinse, Repeat, until EOF. Sync with the threads. Done.

EvilTeach
  • 26,577
  • 21
  • 79
  • 136
2

The best performance you can get with file I/O is via memory mapping. This is an example. I would start from a single threaded design and if post-load processing proves to be a bottleneck make it parallel.

Community
  • 1
  • 1
bobah
  • 16,722
  • 1
  • 31
  • 57