0

What's the difference between these two methods of reading an input file?

1) Using 'ifstream.get()'

and

2) Using a vector<char> with ifstreambuf_iterator<char> (less understood by me!)

(other than the obvious answer of having nifty vector methods to work with)

The input file is XML, and as you see below, immediately parsed into a rapidxml document. (initialized elsewhere, see example main function.)

First, let me show you two ways to write the 'load_config' function, one using ifstream.get() and one using vector<char>

Method 1 ifstream.get() provides working code, and a safe rapidXML document object:

rapidxml::xml_document<> *load_config(rapidxml::xml_document<> *doc){
   ifstream myfile("inputfile");

   //read in config file
   char ch;
   char buffer[65536];
   size_t chars_read = 0;

   while(myfile.get(ch) && (chars_read < 65535)){
      buffer[chars_read++] = ch;
   }
   buffer[chars_read++] = '\0';

   cout<<"clearing old doc"<<endl;
   doc->clear();

   doc->parse<0>(buffer);

   //debug returns as expected here
   cout << "load_config: Name of my first node is: " << doc->first_node()->name() << "\n";

   return doc;
}

Method 2 results in a cloberred rapidXML document by another library - specifically, a call to curl_global_init(CURL_GLOBAL_SSL) [see main code below] - but I'm not blaming it on curl_global_init just yet.

rapidxml::xml_document<> *load_config(rapidxml::xml_document<> *doc){
   ifstream myfile("inputfile");

   vector<char> buffer((istreambuf_iterator<char>(inputfile)), 
                istreambuf_iterator<char>( ));
   buffer.push_back('\0');

   cout<<"file looks like:"<<endl;  //looks fine
   cout<<&buffer[0]<<endl;

   cout<<"clearing old doc"<<endl;
   doc->clear();

   doc->parse<0>(&buffer[0]);

   //debug prints as expected
   cout << "load_config: Name of my first node is: " << doc->first_node()->name() << "\n";

   return doc;
}

main code:

int main(void){
   rapidxml::xml_document *doc;
   doc = new rapidxml::xml_document;

   load_config(doc);

   // this works fine:
   cout << "Name of my first node is: " << doc->first_node()->name() << "\n"; 

   curl_global_init(CURL_GLOBAL_SSL);  //Docs say do this first.

   // debug broken object instance:
   // note a trashed 'doc' here if using vector<char> method 
   //  - seems to be because of above line... name is NULL 
   //    and other nodes are now NULL
   //    causing segfaults down stream.
   cout << "Name of my first node is: " << doc->first_node()->name() << "\n"; 

I am pretty darn sure this is all executed in a single thread, but maybe there is something going on beyond my understanding.

I'm also worried that I only fixed a symptom, not a cause... by simply changing my file load function. Looking to the community for help here!

Question: Why would moving away from the vector to a character array fix this?

Hint: I'm aware that rapidXML uses some clever memory management that actually accesses the input string directly.

Hint: The main function above creates a dynamic (new) xml_document. This was not in the original code, and is an artifact of debugging changes. The original (failing) code declared it and did not dynamically allocate it, but identical problems occurred.

Another Hint for full disclosure (although I don't see why it matters) - there is another instance of a vector in this mess of code that is populated by the data in the rapidxml::xml_document object.

FlipMcF
  • 11,666
  • 2
  • 31
  • 43
  • 2
    What is being `sexy` in it? Is this a fashion-show? – Nawaz Jun 15 '11 at 20:17
  • Since the only difference is how the data is read from the file, these questions seem related: http://stackoverflow.com/questions/116038/what-is-the-best-way-to-slurp-a-file-into-a-stdstring-in-c http://stackoverflow.com/questions/195323/what-is-the-most-elegant-way-to-read-a-text-file-with-c – Robᵩ Jun 15 '11 at 20:36
  • 2
    as a sanity check, can you set the debugger to examine memory starting at &buffer[0] for both calls pre and post parse() and see if they are the same in all cases? – ribram Jun 15 '11 at 20:36
  • you still have the bug. When you exit the loop due to `chars_read < 65535` being false, it means that `chars_read == 65535` and then you access the 65535th position, which is one-past the end of the array. – Yakov Galka Jun 15 '11 at 20:43
  • thanks ybungalobill. Shouldn't discount that at all. Noted, appreciated, and fixed. – FlipMcF Jun 15 '11 at 20:52

1 Answers1

5

The only difference between the two is that the vector version works correctly and the char array version causes undefined behavior when the file is longer than 65535 characters (it writes the \0 to the 65535th or 65536th position, which are out-of-bounds).

Another problem that is common to both versions, is that you read the file into a memory that has shorter life-time than the xml_document. Read the documentation:

The string must persist for the lifetime of the document.

When load_config exits the vector is destroyed and the memory is freed. Attempt to access the document cause reading invalid memory (undefined behavior).

In the char array version the memory is allocated on the stack. It is still 'freed' when load_config exists (accessing it causes undefined behavior). But you don't see the crash because it has not yet been overwritten.

Yakov Galka
  • 61,035
  • 13
  • 128
  • 192
  • this is certainly the answer. Eventually saw problems with the char array version also. both the 'static' keyword in function and eventually moving buffer scope to main solved issue. Thank you! – FlipMcF Jun 16 '11 at 14:27