0

I'm parsing an XML file which is too big to load into memory completely, so I am using an xml.etree.ElementTree.iterparse to parse it.

The problem I'm having is that sometimes, when I retrieve an element from the iterator, I find that some information which is present in my XML file becomes ommitted by ElementTree. Is this expected behaviour?

An example

...
<car>
    <engine>
        <part name="pump"\>
        <part name="ECU"\>
    </engine>
</car>
...

Suppose I'm parsing the XML snippet above with an xml.etree.ElementTree.iterparse iterator. In a given instance, the iterator gives me element elem, which points to the XML car element.

Then, I perform xml.etree.ElementTree.dump(elem) to see how well elem captures the actual XML data, and I get:

<car>
    <engine>
        <part name="pump"/>
        <part/>
    </engine>
<car>

Now, notice how the name of the second part element was not captured. Why does this happen and how can I work around it?

Severo Raz
  • 175
  • 11
  • Please provide a proper [mcve] (complete code that we can just copy, paste and run). – mzjn Jan 28 '21 at 11:22

1 Answers1

0

After some deeper searching, I found out that people have reported this issue with other xml parsing libraries as well, while using a parsing iterator for parsing large documents.

It turns out, when you process elements on the "start" event, the element may not be fully loaded. The solution to the problem is to process elements on the "end" event.

From the question by Andreas titled "lxml.etree iterparse() and parsing element completely", I borrow the following quote, which I tracked down as coming from a tutorial on lxml:

"Note that the text, tail, and children of an Element are not necessarily present yet when receiving the start event. Only the end event guarantees that the Element has been parsed completely."

Severo Raz
  • 175
  • 11