5

I am processing large xml files with lxml.iterparse. This works well, but as my files got a lot bigger lately, I found iterparse behaviour that filled my memory. Consider the following code, that writes a file with 300000 elements of one sort and 300000 elem elements and 300000 other_elem elements:

els = ('<elem><subel1>{0}</subel1><subel2>{0}</subel2><subel3>{0}</subel3><subel4>{0}</subel4><subel5>{0}</subel5><subel6>{0}</subel6></elem>'.format(x) for x in range(300000))
other_els = ('<other_elem><subel1>{0}</subel1><subel2>{0}</subel2><subel3>{0}</subel3><subel4>{0}</subel4><subel5>{0}</subel5><subel6>{0}</subel6></other_elem>'.format(x) for x in range(300000))

with open('/tmp/test.xml', 'w') as fp:
   fp.write('<root>\n')
   fp.write('<elements>\n')
   for el in els:
       fp.write(el+'\n')
   fp.write('</elements>\n')
   fp.write('<other_elements>\n')
   for el in other_els:
       fp.write(el+'\n')
   fp.write('</other_elements>\n')
   fp.write('</root>\n')

I then use the following to parse only the elem (and do nothing with them), while printing memory usage from time to time:

from lxml import etree
import psutil
import os

process = psutil.Process(os.getpid())
gen = etree.iterparse('/tmp/test.xml', tag='elem')
elscount = 0
for ac,el in gen:
    elscount += 1
    el.clear()
    if el.getprevious() is not None:
        del(el.getparent()[0])
    if elscount % 10000 == 0:
        print process.get_memory_info().rss/(1024*1024)

print process.get_memory_info().rss/(1024*1024)

The output shows low memory usage until the end, when it suddenly jumps. This behaviour disappears when I try to read a file that does not contain the other_elems. A slower workaround that leaves out the tag argument to iterparse and instead uses an if construct to test for that leaves the memory free, possibly because it can do el.clear() on the elements that don't match. My question is therefore not how to solve this, but why does iterparse waste memory on the elements it does not have to output, or possibly, what am I doing wrong here?

glormph
  • 904
  • 5
  • 12

0 Answers0