I appreciate your help on the following: I need to read a large XML file and convert it to CSV.
I have two functions that are suppose to do the same, only that one (function1) uses iterparse (because I need to process about 2GB files) and another that doesn't (function2).
Function2 works really well for the same XML file (but up to 150 MB), and after that size it fails due to memory.
The problem I have is that, despite the fact that the code (for function1) does not give errors, it looses some of the children (this is a huge problem!). Function2 on the other hand reads all the children and doesn't 'loose' or fail any.
Q: Could you see in the code of function1 the reasons why some children would be lost (or not read correctly, or ignored) ?
Note1: I have a 50 KB XML sample ready to send in case needed.
Note2: the variable 'nchil_count' is just to count the number of children.
CODE (function1):
def function1 ():
# This function uses Iterparse
# Doesn't give errors but looses some children. Why?
# prints output to csv file, WCEL.csv
from xml.etree.cElementTree import iterparse
fname = "C:\Leonardo\Input data\Xml input data\NetactFiles\Netact_3g_rnc11_t1.xml"
# ELEMENT_LIST = ["WCEL"]
# Delete contents from exit file
open("C:\Leonardo\Input data\Xml input data\WCEL.csv", 'w').close()
# Open exit file
with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:
with open(fname) as xml_doc:
context = iterparse(xml_doc, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
if event == "start" and elem.tag == "{raml20.xsd}managedObject":
# if event == "start":
if elem.get('class') == 'WCEL':
print elem.attrib
# print elem.tag
element = elem.getchildren()
nchil_count = 0
for child in element:
if child.tag == "{raml20.xsd}p":
nchil_count = nchil_count + 1
# print child.tag
# print child.attrib
val = child.text
# print val
val = str (val)
exit_file.write(val + ",")
exit_file.write('\n')
print nchil_count
elif event == "end" and elem.tag == "{raml20.xsd}managedObject":
# Clear Memory
root.clear()
xml_doc.close()
exit_file.close()
return ()
CODE (function2):
def function2 (xmlFile):
# Using Element Tree
# Successful
# Works well with files of 150 MB, like an XML (RAML) RNC export from Netact (1 RNC only)
# It fails with huge files due to Memory
import xml.etree.cElementTree as etree
import shutil
with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:
# Populate the values per cell:
tree = etree.parse(xmlFile)
for value in tree.getiterator(tag='{raml20.xsd}managedObject'):
if value.get('class') == 'WCEL':
print value.attrib
element = value.getchildren()
nchil_count = 0
for child in element:
if child.tag == "{raml20.xsd}p":
nchil_count = nchil_count + 1
# print child.tag
# print child.attrib
val = child.text
# print val
val = str (val)
exit_file.write(val + ",")
exit_file.write('\n')
print nchil_count
exit_file.close() ## File closing after writing.
return ()