0

I have the following input XML file,i read the rel_notes tag and print it...running into the following error

Input XML:

<rel_notes>
    •   Please move to this build for all further test and development activities 
    •   Please use this as base build to verify compilation and sanity before any check-in happens

</rel_notes>

Sample python code:

file = open('data.xml,'r')
from xml.etree import cElementTree as etree
tree = etree.parse(file)
print('\n'.join(elem.text for elem in tree.iter('rel_notes')))

OUTPUT

   print('\n'.join(elem.text for elem in tree.iter('rel_notes')))
 File "C:\python2.7.3\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2022' in position 9: character maps to <undefined>
user1795998
  • 4,367
  • 6
  • 19
  • 21

2 Answers2

1

The issue is with printing Unicode to Windows console. Namely, the character '•' can't be represented in cp437 used by your console.

To reproduce the problem, try:

print u'\u2022'

You could set PYTHONIOENCODING environment variable to instruct python to replace all unrepresentable characters with corresponding xml char references:

T:\> set PYTHONIOENCODING=cp437:xmlcharrefreplace
T:\> python your_script.py

Or encode the text to bytes before printing:

print u'\u2022'.encode('cp437', 'xmlcharrefreplace')

answer to your initial question

To print text of each <build_location/> element:

import sys
from xml.etree import cElementTree as etree

input_file = sys.stdin # filename or file object
tree = etree.parse(input_file)
print('\n'.join(elem.text for elem in tree.iter('build_location')))

If input file is large; iterparse() could be used:

import sys
from xml.etree import cElementTree as etree

input_file = sys.stdin
context = iter(etree.iterparse(input_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
    if event == 'end' and elem.tag == 'build_location':
       print(elem.text)
       root.clear() # free memory
Community
  • 1
  • 1
jfs
  • 346,887
  • 152
  • 868
  • 1,518
  • running into the following error for _, elem in etree.iterparse(file): File "", line 107, in next cElementTree.ParseError: no element found: line 1, column 0 – user1795998 Nov 04 '12 at 04:58
  • @user1795998: is the `file` empty? Update your question with the code that you use. – jfs Nov 04 '12 at 05:17
  • its not emtpy...am using 2.7.3,can you please suggest to get it to work using 2.7.3,am using import xml.dom.minidom as minidom – user1795998 Nov 04 '12 at 05:29
  • @user1795998: copy-paste any of the above examples as is and replace `sys.stdin` by a filename or a file object (such as returned by `open()` function). – jfs Nov 04 '12 at 05:32
  • @user1795998: yes. cElementTree is available since Python 2.5 at least. If you click any of the links in the answer you'll see that it works on Python 2.7 – jfs Nov 04 '12 at 05:38
  • @sebastian..I copied the example as is.running into an error,I updated my question with the same – user1795998 Nov 04 '12 at 05:56
  • @user1795998: I've updated the answer. btw, don't replace entire question next time, just append the update at the end of your question. – jfs Nov 04 '12 at 06:20
0

I don't think the entire snippet above is completely helpful. But, UnicodeEncodeError usually happens when the ASCII characters aren't handled properly.

Example:

unicode_str = html.decode(<source encoding>)

encoded_str = unicode_str.encode("utf8")

Its already explained clearly in this answer: Python: Convert Unicode to ASCII without errors

This should at least solve the UnicodeEncodeError.

Community
  • 1
  • 1