Consider the following code:
std::vector<int> indices = /* Non overlapping ranges. */;
std::istream& in = /*...*/;
for(std::size_t i= 0; i< indices.size()-1; ++i)
{
in.seekg(indices[i]);
std::vector<int> data(indices[i+1] - indices[i]);
in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));
process_data(data);
}
I would like to make this code parallel and as fast a possible.
One way of parallizing it using PPL would be:
std::vector<int> indices = /* Non overlapping ranges. */;
std::istream& in = /*...*/;
std::vector<concurrency::task<void>> tasks;
for(std::size_t i= 0; i< indices.size()-1; ++i)
{
in.seekg(indices[i]);
std::vector<int> data(indices[i+1] - indices[i]);
in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));
tasks.emplace_back(std::bind(&process_data, std::move(data)));
}
concurrency::when_all(tasks.begin(), tasks.end()).wait();
The problem with this approach is that I want to process the data (which fits into CPU cache) in the same thread as it was read into memory (where the data is hot in cache), which is not the case here, it is simply wasting the opportunity of using hot data.
I have two ideas how to improve this, however, I have not been able to realize either.
Start the next iteration on a separate task.
/* ??? */ { in.seekg(indices[i]); std::vector<int> data(indices[i+1] - indices[i]); // data size will fit into CPU cache. in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int)); /* Start a task that begins the next iteration? */ process_data(data); }
Use memory mapped files and map the required region of the file and instead of seeking just read from the pointer with the correct offset. Process the data ranges using a
parallel_for_each
. However, I don't understand the performance implication of memory mapped files in terms of when it is read to memory and cache considerations. Maybe I don't even have to consider the cache since the file is simply DMA:d to system memory, never passing through the CPU?
Any suggestions or comments?