2

I'm getting a strange error with rapidxml when parsing a xml file like

<?xml version="1.0" encoding="UTF-8"?>
<IMG align="left"
 src="http://www.w3.org/Icons/WWW/w3c_home" />

It throws "expected >". Im using a code like the following to parse the data

std::fstream file("./test.xml");
std::istream_iterator<char> eos;
std::istream_iterator<char> iit (file);

std::vector<char> xml(iit, eos);
xml.push_back('\0');

xml_document<> doc;
doc.parse<0>(&xml[0]);

the "/" symbol in the IMG rag seems t be the problem. Is this a rapidxml bug or am I doing something wrong?

P3trus
  • 5,540
  • 6
  • 34
  • 50

3 Answers3

2

The way you load the XML data into vector is wrong. In C++ text mode streams have "skipws" flag set by default, which causes them to skip all whitespace in the input. You can verify this by examining the contents of your vector - it will have all spaces/endlines missing. This obviously causes the parser to complain.

Unset skipws flag on the stream to get the correct behaviour:

file.unsetf(ios::skipws);

Alternatively, you can use file class from rapidxml_utils.hpp to load the file:

using namespace rapidxml;
file<> file("test.xml");
xml_document<> doc;
doc.parse<0>(file.data());

Sadly, loading text files with C++ streams is very tricky and full of traps.

As for sehe tests above, the "incorrectly accepted" cases are by design (I don't have enough reputation to add comments to his answer). You need to use "parse_validate_closing_tags" parse flag to make the parser check whether end tag name matches starting tag name:

doc.parse<parse_validate_closing_tags>(...);

See parse_validate_closing_tags in rapidxml manual. The rationale for this behaviour is performance - verifying end tags is time consuming and in most cases not needed.

kaalus
  • 3,728
  • 3
  • 24
  • 34
1

I just tried it out of curiosity. RapidXml might be fast, but it sure isn't very good

#include "rapidxml.hpp"

int main(int argc, char* args[])
{
        using namespace rapidxml;
        xml_document<> doc;    // character type defaults to char
        doc.parse<0>(args[1]);    // 0 means default parse flags

}

Invoking it results in all kinds of funny business:

Correctly accepted:

$ ./test.exe "<hello>world</hello>"

$ ./test.exe '<?xml version="1.0" encoding="UTF-8"?> <IMG align="left" src="http://www.w3.org/Icons/WWW/w3c_home" />'

Correctly rejected

$ ./test.exe '<hello we="" / >'
terminate called after throwing an instance of 'rapidxml::parse_error'
  what():  expected >
Aborted (core dumped)

Incorrectly accepted:

$ ./test.exe '<hello we="close">world</die><zellq></die>'

$ ./test.exe '<hello we="close/">world</die><we horrible=""></don'\''t>'

YMMV

sehe
  • 328,274
  • 43
  • 416
  • 565
  • 1
    The "incorrectly accepted" cases are by design. You need to use "parse_validate_closing_tags" parse flag to make the parser check whether end tag name matches starting tag name: doc.parse(...); See parse_validate_closing_tags in rapidxml manual. The rationale for this behaviour is performance - verifying end tags is time consuming and in most cases not needed. – kaalus Sep 20 '11 at 10:24
  • @kaalus: +1 and thanks for the heads up. I feel that is really a bit of a mess-up when an xml parser does not parse XML by default, but at least it is good to know that things are not quit as bad as they appeared! – sehe Sep 20 '11 at 13:14
0

Your XML is valid. If the code and the XML are exactly as you posted, it must be a rapidxml bug. I guess it either doesn't support breaking attribute list among multiple lines, or less likely, doesn't support /> for end of tag.

Yakov Galka
  • 61,035
  • 13
  • 128
  • 192