0

I want to parse a large XML File (33000 lines). Following the structure of my xml file:

<?xml version="1.0" encoding="UTF-8"?><Root_2010 xmlns:noNamespaceSchemaLocation="textpool_1.2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" lang="de-DE">
<Textpool Version="V20.12.08">
<TextpoolList FontFamily="Standard" FontSize="16" FontStyle="normal" FontWeight="bold" SID="S1" TextCharacterLength="0" TextLength="135">
<Text>GlobalCommonTextBook</Text>
</SID_Name>
<TextpoolBlock>
<TextpoolRecord CharacterLengthCheck="Ok" Status="Released" StdTextCharacterLength="4" StdTextLength="???" TID="Txt0_0" TermCheck="NotChecked" TermCheckDescription="NotChecked" TextLengthCheck="Ok" fixed="true">
<IEC translate="no">
<Text/>
</IEC>
<ExplanationText/>
<Text>nein</Text>
</ShortText>
</Description>
<Creator>z0046abb</Creator>
</TextpoolRecord>
</TextpoolBlock>
</TextpoolList>
</Textpool>
</Root_2010>

The element TextpoolList stores two parts. Its name is stored in the first Text element. In TextpoolBlock are several entries stored. The element of interest is again Text.

I need to parse this file and extract all Text elements from the specific TextpoolList to export it into another file. Future prospect is to take advantage of the attributes of TextpoolList and scan entries added to ShortText. That's why I want to use some XMLParser.

I decided to give XMLRapid a chance. Since this file is quite large I need to switch some data from stack to heap. Since I don't really know how to do it I am asking you for some help. I tried something alike to https://linuxhint.com/parse_xml_in_c__/.

    rapidxml::xml_document<> doc;
    rapidxml::xml_node<>* root_node = NULL;
    rapidxml::xml_node<>* block_node = NULL;
    rapidxml::xml_node<>* record_node = NULL;
    rapidxml::xml_node<>* text_node = NULL;

    std::ifstream infile(file);
    std::string line;
    std::string tp_data;

    while (std::getline(infile, line))
        tp_data += line;

    std::vector<char> tp_data_copy(tp_data.begin(), tp_data.end());

    tp_data_copy.push_back('\0');

    doc.parse<0>(&tp_data_copy[0]);

    root_node = doc.first_node("TextpoolList");

    for (rapidxml::xml_node<>* textpool_node = root_node->first_node("Textpool"); textpool_node; textpool_node = textpool_node->next_sibling())
    {
        for (rapidxml::xml_node<>* list_node = textpool_node->first_node("TextpoolList"); list_node; list_node = list_node->next_sibling())
        {
            for (rapidxml::xml_node<>* block_node = list_node->first_node("TextpoolBlock"); block_node; block_node = block_node->next_sibling())
            {
                for (rapidxml::xml_node<>* record_node = block_node->first_node("TextpoolRecord"); record_node; record_node = record_node->next_sibling())
                {
                    for (rapidxml::xml_node<>* text_node = record_node->first_node("Text"); text_node; text_node = text_node->next_sibling())
                    {
                        std::cout << "record =   " << text_node->value();
                        std::cout << std::endl;
                    }
                    std::cout << std::endl;
                }
            }
        }
    }
    }

Edit: I changed my code in a way I thought the data would land on the heap but I still get the same error to rather store data on the heap instead of the stack.

Thanks for all your ideas!

  • 1
    `This example doesn't work out.` - what exactly didn't work out? – SergeyA May 03 '21 at 18:33
  • `root_node->first_node("TextpoolList")` shouldn't that be "TextpoolBlock" ? – acraig5075 May 03 '21 at 18:48
  • It is better to use XSLT transformation for the task. – Yitzhak Khabinsky May 03 '21 at 18:48
  • Generally that looks OK, although reading into a `std::string` rather than `std::vector` is cleaner, and also is heap-based. However, your traversal of the doc doesn't match the XML sample. You don't traverse `TextpoolBlock`s, And you print the `->value()` of `TextPoolRecord`s, but they don't have one - just sub-elements. Did you mean to print the value of their `Text` node instead? – Roddy May 04 '21 at 13:47
  • If I want to read it in std::string some internal rapidxml thing fail. Yes I wanted to print the value of their Text node but I didnt know how to access it – doublesobig May 04 '21 at 16:19

1 Answers1

0

Okay things finally work. This is my routine:

    rapidxml::xml_document<> doc;
    rapidxml::xml_node<>* root_node = NULL;
    rapidxml::xml_node<>* block_node = NULL;
    rapidxml::xml_node<>* record_node = NULL;
    rapidxml::xml_node<>* text_node = NULL;
    rapidxml::xml_node<>* list_node = NULL;

    std::ifstream infile(file);
    std::string line;
    std::string tp_data;

    while (std::getline(infile, line))
        tp_data += line;

    std::vector<char> tp_data_copy(tp_data.begin(), tp_data.end());

    tp_data_copy.push_back('\0');

    doc.parse<0>(&tp_data_copy[0]);

    root_node = doc.first_node("Root_2010");

    for (rapidxml::xml_node<>* textpool_node = root_node->first_node("Textpool"); textpool_node; textpool_node = textpool_node->next_sibling())
    {
        for (rapidxml::xml_node<>* list_node = textpool_node->first_node("TextpoolList"); list_node; list_node = list_node->next_sibling())
        {
            for (rapidxml::xml_node<>* block_node = list_node->first_node("TextpoolBlock"); block_node; block_node = block_node->next_sibling())
            {
                for (rapidxml::xml_node<>* record_node = block_node->first_node("TextpoolRecord"); record_node; record_node = record_node->next_sibling())
                {
                    for (rapidxml::xml_node<>* text_node = record_node->first_node("Text"); text_node; text_node = text_node->next_sibling())
                    {
                        std::cout << "record =   " << text_node->value();
                        std::cout << std::endl;
                    }
                    std::cout << std::endl;
                }
            }
        }
    }

If there is some time left I try to find some workaround for that file reading fuckery.

  • Consider flipping your conditionals to reduce the large amounts of arrow code. – Casey May 04 '21 at 18:49
  • A few comments: #1 You don't need the `rapidxml::xml_node<>*` variables at the top, because you (correctly) use declare them in the loop scopes. Apart from `root_node`, of course, which you should declare and initialise in one line. `rapidxml::xml_node<>* root_node = doc.first_node(...` – Roddy May 05 '21 at 06:58
  • #2: your `next_sibling()` calls don't specify a node name, so you'll fetch the next sibling regardles of name. – Roddy May 05 '21 at 07:01
  • #3: Your XML 'sample' is seriously malformed. https://www.xmlvalidation.com/index.php?id=1&L=0 – Roddy May 05 '21 at 07:01
  • #4: Your file reading is super-weird, and will remove all newline sequences so multiline `Text` nodes won't work. Use `rapidxml::file`. https://stackoverflow.com/questions/2808022/how-to-parse-an-xml-file-with-rapidxml/14524464 – Roddy May 05 '21 at 07:08
  • #5: Does your data format actually need multiple Text nodes under each `Textpool`? if not, you can remove the loop. – Roddy May 05 '21 at 07:10