1

I appreciate your help on the following: I need to read a large XML file and convert it to CSV.

I have two functions that are suppose to do the same, only that one (function1) uses iterparse (because I need to process about 2GB files) and another that doesn't (function2).

Function2 works really well for the same XML file (but up to 150 MB), and after that size it fails due to memory.

The problem I have is that, despite the fact that the code (for function1) does not give errors, it looses some of the children (this is a huge problem!). Function2 on the other hand reads all the children and doesn't 'loose' or fail any.

Q: Could you see in the code of function1 the reasons why some children would be lost (or not read correctly, or ignored) ?

Note1: I have a 50 KB XML sample ready to send in case needed.
Note2: the variable 'nchil_count' is just to count the number of children.

CODE (function1):

def function1 ():
    # This function uses Iterparse
    # Doesn't give errors but looses some children. Why?
    # prints output to csv file, WCEL.csv

    from xml.etree.cElementTree import iterparse

    fname = "C:\Leonardo\Input data\Xml input data\NetactFiles\Netact_3g_rnc11_t1.xml"
    # ELEMENT_LIST = ["WCEL"]

    # Delete contents from exit file
    open("C:\Leonardo\Input data\Xml input data\WCEL.csv", 'w').close()

    # Open exit file
    with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:

        with open(fname) as xml_doc:
            context = iterparse(xml_doc, events=("start", "end"))
            context = iter(context)
            event, root = context.next()

            for event, elem in context:

                if event == "start" and elem.tag == "{raml20.xsd}managedObject":
                # if event == "start":
                    if elem.get('class') == 'WCEL':
                        print elem.attrib
                        # print elem.tag

                        element = elem.getchildren()
                        nchil_count = 0

                        for child in element:
                            if child.tag == "{raml20.xsd}p":
                                nchil_count = nchil_count + 1
                                # print child.tag
                                # print child.attrib
                                val = child.text
                                # print val
                                val = str (val)
                                exit_file.write(val + ",")

                        exit_file.write('\n')
                        print nchil_count

                elif event == "end" and elem.tag == "{raml20.xsd}managedObject":
                    # Clear Memory
                    root.clear()

    xml_doc.close()
    exit_file.close()

    return ()

CODE (function2):

def function2 (xmlFile):
    # Using Element Tree
    # Successful
    # Works well with files of 150 MB, like an XML (RAML) RNC export from Netact (1 RNC only)
    # It fails with huge files due to Memory

    import xml.etree.cElementTree as etree
    import shutil

    with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:

        # Populate the values per cell:

        tree = etree.parse(xmlFile)
        for value in tree.getiterator(tag='{raml20.xsd}managedObject'):
            if value.get('class') == 'WCEL':
                print value.attrib

                element = value.getchildren()
                nchil_count = 0

                for child in element:
                    if child.tag == "{raml20.xsd}p":
                        nchil_count = nchil_count + 1
                        # print child.tag
                        # print child.attrib
                        val = child.text
                        # print val

                        val = str (val)
                        exit_file.write(val + ",")

                exit_file.write('\n')
                print nchil_count

    exit_file.close() ## File closing after writing.

    return ()
Andrew
  • 13,609
  • 4
  • 44
  • 63

1 Answers1

1

I had a similar problem. There were some important differences, though:

  • I used lxml.etree, not xml.etree (binary version for Windows 'lxml-3.4.2-cp34-none-win32.whl' from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml)
  • I used iterparse for a specific element with end event active
  • Then I was drilling down this element with use of xpath() method

But the result was equivalent: some of the nodes were ignored (lost). Nothing in the file could explain why. For the given file - the same nodes. But when you made only a technical change (format with xmllint) - other nodes where lost.

I reorganized code (no xpath(), iterparse without tag argument, both 'start' and 'end' events, controlling the process with a element.tag property value) and found out that SOMETIMES (I don't know when) THE PROCESS "FORGETS" THE DEFAULT NAMESPACE. I mean, in most cases the value of element.tag was "{namespace uri}tag_name", but in about 2% of cases - just "tag_name". That's why they wasn't found by xpath().

I knew that everything in the file was from one default namespace, so I could add "{namespace uri}" myself, and had the file processed correctly.

There was no problem when there was a namespace prefix declared explicitly in main tag and used in all other tags.

This looks like a bug somewhere in parsing large XML files - probably not in lxml if you had the same effect in xml.etree?