Questions tagged [iterparse]

iterparse is used by XML parsers for tracking changes to the tree while it is being built

This tag is used in an XML parsing code. Usually iterparse builds a tree when parsing the XML. Also you can safely rearrange or remove parts of the tree while parsing.

See also:

72 questions
27
votes
3 answers

Why is lxml.etree.iterparse() eating up all my memory?

This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference. What am I doing wrong / how can I process this large file with…
sente
  • 2,097
  • 2
  • 16
  • 23
24
votes
2 answers

ElementTree iterparse strategy

I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing). My concern is the following, imagine you have an xml like this
16
votes
3 answers

using lxml and iterparse() to parse a big (+- 1Gb) XML file

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content": MM/DD/YY Last Name, Name Lorem ipsum…
mvime
  • 297
  • 1
  • 2
  • 8
13
votes
1 answer

lxml etree.iterparse error "TypeError: reading file objects must return plain strings"

I would like to parse an HTML document using lxml. I am using python 3.2.3 and lxml 2.3.4 ( http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml ) I am using the etree.iterparse to parse the document, but it returns the following run-time…
Ababneh A
  • 936
  • 4
  • 14
  • 29
11
votes
4 answers

Parsing huge, badly encoded XML files in Python

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream because loading them into memory is much too inefficient and often leads to…
Rik
  • 213
  • 3
  • 10
8
votes
2 answers

How to parse this huge XML file with nested elements using lxml the efficient way?

I tried parsing this huge XML document using XML minidom. While it worked fine on a sample file, it choked the system when trying to process the real file (about 400 MB). I tried adapting code (it processes data in a streaming fashion rather than…
ThinkCode
  • 7,189
  • 18
  • 67
  • 88
8
votes
2 answers

lxml iterparse in python can't handle namespaces

from lxml import etree import StringIO data= StringIO.StringIO('OneTwoThree') docs = etree.iterparse(data,tag='a') a,b = docs.next() Traceback (most recent call last): File…
James Townley
  • 193
  • 2
  • 7
6
votes
4 answers

Iteratively parsing HTML (with lxml?)

I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) to reduce the amount of memory used. The problem I'm having is that I'm getting XML syntax errors such as: lxml.etree.XMLSyntaxError: Attribute name redefined, line…
Acorn
  • 44,010
  • 23
  • 124
  • 163
5
votes
2 answers

lxml.etree.iterparse closes input file handler?

filterous is using iterparse to parse a simple XML StringIO object in a unit test. However, when trying to access the StringIO object afterwards, Python exits with a "ValueError: I/O operation on closed file" message. According to the iterparse…
l0b0
  • 48,420
  • 21
  • 118
  • 185
5
votes
1 answer

Iterparse object has no attribute next

I am parsing a 700mb file, I have the following code with works fine on my test file without the line context.iter(context) and event, elem = context.next(). form xml.etree import cElementTree as ET source = ("AAT.xml") context =…
ADWALSH
  • 75
  • 1
  • 4
5
votes
0 answers

lxml iterparse tag argument and memory consumption

I am processing large xml files with lxml.iterparse. This works well, but as my files got a lot bigger lately, I found iterparse behaviour that filled my memory. Consider the following code, that writes a file with 300000 elements of one sort and…
glormph
  • 904
  • 5
  • 12
4
votes
0 answers

Iterparse truncating XML elements

I have a large XML file (about 600 MB) that I am trying to parse using cElementTree with iterparse. First time attempting this. I am iterating on 'product' tags and elem.clear()-ing after I process each product. Within my parsing I have a function…
alsoALion
  • 399
  • 1
  • 4
  • 16
4
votes
1 answer

Retrieving XML attribute values using Python iterparse

I'm trying to find out how to retrieve XML attribute values using the cElementTree iterparse in Python (2.7). My XML is something like this:
RTF
  • 5,332
  • 8
  • 47
  • 104
4
votes
1 answer

Use iterparse and, subsequently, xpath on documents with inconsistent namespace declarations

I need to put together a piece of code that parses a possibly large XML file into custom Python objects. The idea is roughly the following: from lxml import etree for e, tag in etree.iterparse(source, tag='Foo'): print tag.xpath('bar/baz')[42] #…
Lev Levitsky
  • 55,704
  • 18
  • 130
  • 156
3
votes
5 answers

Ignore encoding errors in Python (iterparse)?

I've been fighting with this for an hour now. I'm parsing an XML-string with iterparse. However, the data is not encoded properly, and I am not the provider of it, so I can't fix the encoding. Here's the error I get: lxml.etree.XMLSyntaxError: line…
Martti Laine
  • 11,524
  • 19
  • 62
  • 100
1
2 3 4 5