75

I want to parse my XML document. So I have stored my XML document as below

class XMLdocs(db.Expando):  
   id = db.IntegerProperty()    
   name=db.StringProperty()  
   content=db.BlobProperty()  

Now my below is my code

parser = make_parser()     
curHandler = BasketBallHandler()  
parser.setContentHandler(curHandler)  
for q in XMLdocs.all():  
        parser.parse(StringIO.StringIO(q.content))

I am getting below error

'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
Traceback (most recent call last):  
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 517, in __call__
    handler.post(*groups)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/base_handler.py", line 59, in post
    self.handle()   
  File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 168, in handle
    scan_aborted = not self.process_entity(entity, ctx)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 233, in process_entity
    handler(entity)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 71, in process
    parser.parse(StringIO.StringIO(q.content))   
  File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)   
  File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)  
  File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 136, in characters   
    print ch   
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)   
Vinay Sajip
  • 84,585
  • 13
  • 155
  • 165
mahesh
  • 4,007
  • 11
  • 38
  • 58
  • 2
    Your stacktrace shows that your executing code is different to what you pasted - and that you're using `print`. Don't use print in a WSGI app! – Nick Johnson Oct 17 '11 at 23:41

7 Answers7

112

The actual best answer for this problem depends on your environment, specifically what encoding your terminal expects.

The quickest one-line solution is to encode everything you print to ASCII, which your terminal is almost certain to accept, while discarding characters that you cannot print:

print ch #fails
print ch.encode('ascii', 'ignore')

The better solution is to change your terminal's encoding to utf-8, and encode everything as utf-8 before printing. You should get in the habit of thinking about your unicode encoding EVERY time you print or read a string.

Triptych
  • 188,472
  • 32
  • 145
  • 168
  • 1
    in my case , i was printing twitter stream to a terminal , and it was working fine. Then i wanted to redirect the programs output to a file , i started getting 'ascii' codec can't encode characters in position 32-36 . Later , as in this answer, i used print tweet.encode("utf-8",ignore) , and it all worked. – kommradHomer Mar 26 '14 at 12:29
57

Just putting .encode('utf-8') at the end of object will do the job in recent versions of Python.

Nicole
  • 679
  • 5
  • 3
30

It seems you are hitting a UTF-8 byte order mark (BOM). Try using this unicode string with BOM extracted out:

import codecs

content = unicode(q.content.strip(codecs.BOM_UTF8), 'utf-8')
parser.parse(StringIO.StringIO(content))

I used strip instead of lstrip because in your case you had multiple occurences of BOM, possibly due to concatenated file contents.

Morgan Wilde
  • 15,065
  • 9
  • 47
  • 94
Tugrul Ates
  • 8,756
  • 1
  • 30
  • 51
  • I have done exactly as mentioned in answer but getting the above error, First it was giving me at position 0 mentioned in question and now it is giving me at position 5785 mentioned in prev comment – mahesh Feb 28 '11 at 12:42
  • I recommend converting any string `s` which produces the error with `s = unicode(s.strip(codecs.BOM_UTF8), 'utf-8')`. `s` refers to the name of your strings. – Tugrul Ates Feb 28 '11 at 12:45
  • Try to replace `lstrip` with `strip`. – Tugrul Ates Feb 28 '11 at 12:50
  • I understand what you are suggesting and I had done the same error in detail : ascii' codec can't encode character u'\xef' in position 5785: ordinal not in range(128) – mahesh Feb 28 '11 at 12:50
  • 1
    It's an encode error during the conversion of an unicode to string during printing. It won't contain a UTF-8 BOM, it can't be decoded back to unicode, and the error is because it countains non-ASCII characters - removing them would *break* the content, and the BOM is only one of them. – Rosh Oxymoron Feb 28 '11 at 12:50
30

This worked for me:

from django.utils.encoding import smart_str
content = smart_str(content)
Orlando Pozo
  • 311
  • 3
  • 5
8

The problem according to your traceback is the print statement on line 136 of parseXML.py. Unfortunately you didn't see fit to post that part of your code, but I'm going to guess it is just there for debugging. If you change it to:

print repr(ch)

then you should at least see what you are trying to print.

Duncan
  • 79,697
  • 10
  • 108
  • 148
  • 2
    -1 for non-unicode solution to an obvious unicode encoding problem. – Triptych Feb 28 '11 at 18:11
  • 7
    The unicode encoding problem is with the print statement. Yes, there may be other issues but fixing the print to not crash is the immediate issue. – Duncan Feb 28 '11 at 18:16
7

The problem is that you're trying to print an unicode character to a possibly non-unicode terminal. You need to encode it with the 'replace option before printing it, e.g. print ch.encode(sys.stdout.encoding, 'replace').

Rosh Oxymoron
  • 17,904
  • 5
  • 37
  • 43
  • printing is not essential, the main statement for me where I am getting error is of parse statement – mahesh Feb 28 '11 at 12:59
  • 3
    @Mahesh: It's YOUR code that's causing the problem, at line 136 of parseXML.py -- either fix it yourself, or show us that part of the code so we can help you. – John Machin Feb 28 '11 at 17:22
-1

An easy solution to overcome this problem is to set your default encoding to utf8. Follow is an example

import sys

reload(sys)
sys.setdefaultencoding('utf8')
Hafiz Muhammad Shafiq
  • 6,781
  • 10
  • 49
  • 92