3

Consider the following code:

std::vector<int> indices = /* Non overlapping ranges. */;
std::istream& in = /*...*/;

for(std::size_t i= 0; i< indices.size()-1; ++i)
{
    in.seekg(indices[i]);

    std::vector<int> data(indices[i+1] - indices[i]);

    in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));

    process_data(data);
}

I would like to make this code parallel and as fast a possible.

One way of parallizing it using PPL would be:

std::vector<int> indices = /* Non overlapping ranges. */;
std::istream& in = /*...*/;
std::vector<concurrency::task<void>> tasks;    

for(std::size_t i= 0; i< indices.size()-1; ++i)
{
    in.seekg(indices[i]);

    std::vector<int> data(indices[i+1] - indices[i]);

    in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));

    tasks.emplace_back(std::bind(&process_data, std::move(data)));
}
concurrency::when_all(tasks.begin(), tasks.end()).wait();

The problem with this approach is that I want to process the data (which fits into CPU cache) in the same thread as it was read into memory (where the data is hot in cache), which is not the case here, it is simply wasting the opportunity of using hot data.

I have two ideas how to improve this, however, I have not been able to realize either.

  1. Start the next iteration on a separate task.

    /* ??? */
    {
         in.seekg(indices[i]);
    
         std::vector<int> data(indices[i+1] - indices[i]); // data size will fit into CPU cache.
    
         in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));
    
         /* Start a task that begins the next iteration? */
    
         process_data(data);
    }
    
  2. Use memory mapped files and map the required region of the file and instead of seeking just read from the pointer with the correct offset. Process the data ranges using a parallel_for_each. However, I don't understand the performance implication of memory mapped files in terms of when it is read to memory and cache considerations. Maybe I don't even have to consider the cache since the file is simply DMA:d to system memory, never passing through the CPU?

Any suggestions or comments?

ronag
  • 43,567
  • 23
  • 113
  • 204
  • Why don't you try splitting the files into smaller files, then reading and processing them in separate threads? With files you have to think about CPU-binding and I/O-bindings – askmish Aug 16 '12 at 14:30
  • +1. It seems like any benefits you would gain from keeping the memory in cache would be dwarfed by the wait for file I/O. – Ben Fulton Aug 17 '12 at 14:37
  • @BenFulton: You are right, under the assumption that the file is not in the OS cache and/or that I am not performing multiple concurrent execution of this function. – ronag Aug 17 '12 at 14:44

1 Answers1

0

It's most likely that you are pursuing the wrong goal. As already noted, any advantage of 'hot data' will be dwarfed by disk speed. Otherwise, there're important details you didn't tell.
1) Whether the file is 'big'
2) Whether a single record is 'big'
3) Whether the processing is 'slow'

If the file is 'big', your biggest priority is ensuring that the file is read sequentially. Your "indices" makes me think otherwise. The recent example from my own experience is 6 seconds vs 20 minutes depending on random vs sequential reads. No kidding.

If the file is 'small' and you're positive that it is cached entirely, you just need a syncronized queue to deliver tasks to your threads, then it won't be a problem to process in the same thread.

The other way around is splitting 'indices' into halves, one for each thread.

Codeguard
  • 6,997
  • 1
  • 32
  • 37