3

I'm reading huge XML files in C++ with rapidxml and trying to optimize the reading, because that part consums most of the time (I've measured it with std::chrono).

I.e. I have a XML file with around 40 MB - the actual parsing from rapidxml takes just approx. ~2300 milliseconds (which is absolutly fine). But copying the file from my std::ifstream to an buffer takes around ~30000 milliseconds. I wonder if the bottleneck is the speed of my HDD or if there is anything I could do to save up the buffer copy.

std::ifstream file(filename);
if(file == nullptr){
  throw std::runtime_error("File "+filename+" not found!");
}
rapidxml::xml_document<> doc;
std::vector<char> buffer((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>( ));
buffer.push_back('\0');
doc.parse<0>(&buffer[0]); 
rapidxml::xml_node<>* root = doc.first_node();

The problem is the line: std::vector<char> buffer((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>( )); which takes about 30 seconds on an 40MB file.

Any ideas how I could optimize the reading here?

Constantin
  • 7,926
  • 12
  • 71
  • 112
  • On a first pass, have you tried using a lower level API ? Specifically, reading from `FILE` using [read](http://linux.die.net/man/2/read) by 1MB increments comes to mind since this is most likely what `istream` uses behind the scenes and thus should give you a reasonable lower bound on the performance you may expect. If the difference with `istream` is substantial, then we can start poking at it (replacing its internal buffer with a larger version for example). – Matthieu M. May 07 '14 at 12:12
  • 1
    A second idea might be using the rapidxml::file class defined in rapidxml_utils.h. Does that affect the performance? – Shelling May 07 '14 at 12:16
  • @MatthieuM. On my platform unistd.h is not available (Windows 7, VS 2012), but I could try to read the file with fopen()/fread() instead. I will elaborate if there is an substential performance gain with that. – Constantin May 07 '14 at 12:59
  • @Shelling Oh, very good point - I wasn't aware of that class! If I use the constructor `file(const char *filename)` the reading just takes ~4000 milliseconds - that's a massive performance gain (there pops a compiler warning up, if I use this constructor though - `rapidxml_utils.hpp(40): warning C4244: initializing : conversion from std::streamoff to size_t, possible loss of data`). If I call this constructor from multiple threads parallel (different files), there is an access violation. The constructor `file(std::basic_istream &stream)` performs almost like my own solution (30 seconds). – Constantin May 07 '14 at 13:14
  • When using rapidxml::file to read a file and later parse it with rapidxml::document, the rapidxml::file has to accessible while parsing. Once it runs out of scope, the file is removed from RAM and trying to parse gives an access violation. Can you check if that might be the case? – Shelling May 07 '14 at 13:18
  • 1
    @Shelling: you might want to try that as an embryo of answer, and take the opportunity to illustrate how to call those methods "the right way". – Matthieu M. May 07 '14 at 14:21
  • @Matthieu M. I will do so once I have a more stable internet connection. Mobile is no fun on trains. – Shelling May 07 '14 at 14:28

0 Answers0