C++, RapidXML: Parse large files

Question

I want to parse a large XML File (33000 lines). Following the structure of my xml file:

<?xml version="1.0" encoding="UTF-8"?><Root_2010 xmlns:noNamespaceSchemaLocation="textpool_1.2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" lang="de-DE">
<Textpool Version="V20.12.08">
<TextpoolList FontFamily="Standard" FontSize="16" FontStyle="normal" FontWeight="bold" SID="S1" TextCharacterLength="0" TextLength="135">
<Text>GlobalCommonTextBook</Text>
</SID_Name>
<TextpoolBlock>
<TextpoolRecord CharacterLengthCheck="Ok" Status="Released" StdTextCharacterLength="4" StdTextLength="???" TID="Txt0_0" TermCheck="NotChecked" TermCheckDescription="NotChecked" TextLengthCheck="Ok" fixed="true">
<IEC translate="no">
<Text/>
</IEC>
<ExplanationText/>
<Text>nein</Text>
</ShortText>
</Description>
<Creator>z0046abb</Creator>
</TextpoolRecord>
</TextpoolBlock>
</TextpoolList>
</Textpool>
</Root_2010>

The element TextpoolList stores two parts. Its name is stored in the first Text element. In TextpoolBlock are several entries stored. The element of interest is again Text.

I need to parse this file and extract all Text elements from the specific TextpoolList to export it into another file. Future prospect is to take advantage of the attributes of TextpoolList and scan entries added to ShortText. That's why I want to use some XMLParser.

I decided to give XMLRapid a chance. Since this file is quite large I need to switch some data from stack to heap. Since I don't really know how to do it I am asking you for some help. I tried something alike to https://linuxhint.com/parse_xml_in_c__/.

    rapidxml::xml_document<> doc;
    rapidxml::xml_node<>* root_node = NULL;
    rapidxml::xml_node<>* block_node = NULL;
    rapidxml::xml_node<>* record_node = NULL;
    rapidxml::xml_node<>* text_node = NULL;

    std::ifstream infile(file);
    std::string line;
    std::string tp_data;

    while (std::getline(infile, line))
        tp_data += line;

    std::vector<char> tp_data_copy(tp_data.begin(), tp_data.end());

    tp_data_copy.push_back('\0');

    doc.parse<0>(&tp_data_copy[0]);

    root_node = doc.first_node("TextpoolList");

    for (rapidxml::xml_node<>* textpool_node = root_node->first_node("Textpool"); textpool_node; textpool_node = textpool_node->next_sibling())
    {
        for (rapidxml::xml_node<>* list_node = textpool_node->first_node("TextpoolList"); list_node; list_node = list_node->next_sibling())
        {
            for (rapidxml::xml_node<>* block_node = list_node->first_node("TextpoolBlock"); block_node; block_node = block_node->next_sibling())
            {
                for (rapidxml::xml_node<>* record_node = block_node->first_node("TextpoolRecord"); record_node; record_node = record_node->next_sibling())
                {
                    for (rapidxml::xml_node<>* text_node = record_node->first_node("Text"); text_node; text_node = text_node->next_sibling())
                    {
                        std::cout << "record =   " << text_node->value();
                        std::cout << std::endl;
                    }
                    std::cout << std::endl;
                }
            }
        }
    }
    }

Edit: I changed my code in a way I thought the data would land on the heap but I still get the same error to rather store data on the heap instead of the stack.

Thanks for all your ideas!

`This example doesn't work out.` - what exactly didn't work out? — SergeyA, May 03 '21 at 18:33
`root_node->first_node("TextpoolList")` shouldn't that be "TextpoolBlock" ? — acraig5075, May 03 '21 at 18:48
Generally that looks OK, although reading into a `std::string` rather than `std::vector` is cleaner, and also is heap-based. However, your traversal of the doc doesn't match the XML sample. You don't traverse `TextpoolBlock`s, And you print the `->value()` of `TextPoolRecord`s, but they don't have one - just sub-elements. Did you mean to print the value of their `Text` node instead? — Roddy, May 04 '21 at 13:47
If I want to read it in std::string some internal rapidxml thing fail. Yes I wanted to print the value of their Text node but I didnt know how to access it — doublesobig, May 04 '21 at 16:19

score 0 · Answer 1 · answered May 04 '21 at 18:30

Okay things finally work. This is my routine:

    rapidxml::xml_document<> doc;
    rapidxml::xml_node<>* root_node = NULL;
    rapidxml::xml_node<>* block_node = NULL;
    rapidxml::xml_node<>* record_node = NULL;
    rapidxml::xml_node<>* text_node = NULL;
    rapidxml::xml_node<>* list_node = NULL;

    std::ifstream infile(file);
    std::string line;
    std::string tp_data;

    while (std::getline(infile, line))
        tp_data += line;

    std::vector<char> tp_data_copy(tp_data.begin(), tp_data.end());

    tp_data_copy.push_back('\0');

    doc.parse<0>(&tp_data_copy[0]);

    root_node = doc.first_node("Root_2010");

    for (rapidxml::xml_node<>* textpool_node = root_node->first_node("Textpool"); textpool_node; textpool_node = textpool_node->next_sibling())
    {
        for (rapidxml::xml_node<>* list_node = textpool_node->first_node("TextpoolList"); list_node; list_node = list_node->next_sibling())
        {
            for (rapidxml::xml_node<>* block_node = list_node->first_node("TextpoolBlock"); block_node; block_node = block_node->next_sibling())
            {
                for (rapidxml::xml_node<>* record_node = block_node->first_node("TextpoolRecord"); record_node; record_node = record_node->next_sibling())
                {
                    for (rapidxml::xml_node<>* text_node = record_node->first_node("Text"); text_node; text_node = text_node->next_sibling())
                    {
                        std::cout << "record =   " << text_node->value();
                        std::cout << std::endl;
                    }
                    std::cout << std::endl;
                }
            }
        }
    }

If there is some time left I try to find some workaround for that file reading fuckery.

Consider flipping your conditionals to reduce the large amounts of arrow code. — Casey, May 04 '21 at 18:49
A few comments: #1 You don't need the `rapidxml::xml_node<>*` variables at the top, because you (correctly) use declare them in the loop scopes. Apart from `root_node`, of course, which you should declare and initialise in one line. `rapidxml::xml_node<>* root_node = doc.first_node(...` — Roddy, May 05 '21 at 06:58
#2: your `next_sibling()` calls don't specify a node name, so you'll fetch the next sibling regardles of name. — Roddy, May 05 '21 at 07:01
#3: Your XML 'sample' is seriously malformed. https://www.xmlvalidation.com/index.php?id=1&L=0 — Roddy, May 05 '21 at 07:01
#4: Your file reading is super-weird, and will remove all newline sequences so multiline `Text` nodes won't work. Use `rapidxml::file`. https://stackoverflow.com/questions/2808022/how-to-parse-an-xml-file-with-rapidxml/14524464 — Roddy, May 05 '21 at 07:08
#5: Does your data format actually need multiple Text nodes under each `Textpool`? if not, you can remove the loop. — Roddy, May 05 '21 at 07:10

C++, RapidXML: Parse large files

1 Answers1