Parsing XML with invalid nodes

Question

I have parsing too big XML. When a node fails I want to keep looping and doing stuff with remaining nodes.

version 1

for event, element in etree.iterparse(file):
    if element.tag == "tag1":
        # Doing some stuff

with the first version I get an exception:

ParseError: not well-formed (invalid token): line 319851

So in order to process the remain nodes I have wrote a second version:

version 2

xml_parser = etree.iterparse(file)

while True:
    try:
        event, element = next(xml_parser)

        if element.tag == "tag1":
            # Doing some stuff
        # If there is no more elements to iterate, breaks the loop
        except StopIteration:
            break

        # While another exception, keep looping
        except Exception as e:
            pass

In that case the script entering in a infinite loop.

So I tried go to the specific line opening as a text file:

with open(file) as fp:
    for i, line in enumerate(fp):
        if i == 319850:
            print(319850, line)
        if i == 319851:
            print(319851, line)
        if i == 319852:
            print(319852, line)
        if i == 319853:
            print(319853, line)

            break

I get:

319850    <tag1> <tag11><![CDATA[ foo bar

319851    ]]></tag11></tag1>

319852    <tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>

319853    <tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>

so seems to be that line is cutted by "\n". That is an XML error but why my second version does not works? In my second version, lines 319850 and 319851 are not valid as XML so should be pass and get the next nodes/lines.

What am I doing wrong here? If you have a best approach please let me know.

UPDATE

XML file has an invalid character '\x0b'. So looks like:

<tag1> <tag11><![CDATA[ foo bar '\x0b']]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>

A newline in a CDATA section is not an XML error. How can we reproduce this? — mzjn, Apr 13 '17 at 10:51
See http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space for instance. — Bill Bell, Apr 13 '17 at 22:48

Bill Bell · Answer 1 · 2017-04-14T16:57:37.577

I have taken those lines that seem to be causing trouble and stuffed them into a slightly bigger xml file for trial purposes. This is it.

<whole>
<tag1>
<tag11>one</tag11>
<tag11><![CDATA[ foo bar
]]></tag11>
<tag11>two</tag11>
<tag11>three</tag11>
</tag1>
<tag1> <tag11><![CDATA[ foo bar
]]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
<tag1>
<tag11>three</tag11>
<tag11>four</tag11>
<tag11>five</tag11>
<tag11>six</tag11>
</tag1>
</whole>

Then I ran the following code that displayed its results at the end.

>>> import os
>>> os.chdir('c:/scratch')
>>> from lxml import etree
>>> context = etree.iterparse('temp.xml')
>>> for action, elem in context:
...     print (action, elem.tag, elem.sourceline)
...     
end tag11 3
end tag11 4
end tag11 6
end tag11 7
end tag1 2
end tag11 9
end tag1 9
end tag11 11
end tag1 11
end tag11 12
end tag1 12
end tag11 14
end tag11 15
end tag11 16
end tag11 17
end tag1 13
end whole 1

In short, there seems to be nothing wrong with those lines.

You could try printing the line numbers in which tags were found, in order to find the vicinity of the place giving trouble in the xml. (This is an edit based on knowledge that I have newly acquired on SO.)

I would also suggest using the looping structure suggested in the documentation to avoid the infinite loop. That's what I did in this code.

FYI: I know you've solved your problem but you might be interested in the edit. — Bill Bell, Apr 14 '17 at 16:58

Parsing XML with invalid nodes

1 Answers1

Linked