UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

Question

I want to parse my XML document. So I have stored my XML document as below

class XMLdocs(db.Expando):  
   id = db.IntegerProperty()    
   name=db.StringProperty()  
   content=db.BlobProperty()

Now my below is my code

parser = make_parser()     
curHandler = BasketBallHandler()  
parser.setContentHandler(curHandler)  
for q in XMLdocs.all():  
        parser.parse(StringIO.StringIO(q.content))

I am getting below error

'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
Traceback (most recent call last):  
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 517, in __call__
    handler.post(*groups)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/base_handler.py", line 59, in post
    self.handle()   
  File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 168, in handle
    scan_aborted = not self.process_entity(entity, ctx)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 233, in process_entity
    handler(entity)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 71, in process
    parser.parse(StringIO.StringIO(q.content))   
  File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)   
  File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)  
  File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 136, in characters   
    print ch   
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

Your stacktrace shows that your executing code is different to what you pasted - and that you're using `print`. Don't use print in a WSGI app! — Nick Johnson, Oct 17 '11 at 23:41

score 112 · Answer 1 · answered Feb 28 '11 at 19:59

The actual best answer for this problem depends on your environment, specifically what encoding your terminal expects.

The quickest one-line solution is to encode everything you print to ASCII, which your terminal is almost certain to accept, while discarding characters that you cannot print:

print ch #fails
print ch.encode('ascii', 'ignore')

The better solution is to change your terminal's encoding to utf-8, and encode everything as utf-8 before printing. You should get in the habit of thinking about your unicode encoding EVERY time you print or read a string.

in my case , i was printing twitter stream to a terminal , and it was working fine. Then i wanted to redirect the programs output to a file , i started getting 'ascii' codec can't encode characters in position 32-36 . Later , as in this answer, i used print tweet.encode("utf-8",ignore) , and it all worked. — kommradHomer, Mar 26 '14 at 12:29

score 57 · Answer 2 · answered Aug 16 '13 at 08:23

57

Just putting .encode('utf-8') at the end of object will do the job in recent versions of Python.

answered Aug 16 '13 at 08:23

Nicole

679
5
3

3

What do you mean with "recent versions of Python"? Only `3.x`, or also `2.7`? – kramer65 Dec 08 '15 at 13:42
1

Python 2.7 is clearly recent since it's still in wide spread use. – tmthyjames Feb 29 '16 at 21:18
1

Works for me on Python 2.7 – A Star Mar 09 '17 at 20:42

score 30 · Accepted Answer · edited Mar 09 '13 at 22:17

30

It seems you are hitting a UTF-8 byte order mark (BOM). Try using this unicode string with BOM extracted out:

import codecs

content = unicode(q.content.strip(codecs.BOM_UTF8), 'utf-8')
parser.parse(StringIO.StringIO(content))

I used strip instead of lstrip because in your case you had multiple occurences of BOM, possibly due to concatenated file contents.

edited Mar 09 '13 at 22:17

Morgan Wilde

15,065
9
47
94

answered Feb 28 '11 at 11:59

Tugrul Ates

8,756
1
30
51

I have done exactly as mentioned in answer but getting the above error, First it was giving me at position 0 mentioned in question and now it is giving me at position 5785 mentioned in prev comment – mahesh Feb 28 '11 at 12:42
I recommend converting any string `s` which produces the error with `s = unicode(s.strip(codecs.BOM_UTF8), 'utf-8')`. `s` refers to the name of your strings. – Tugrul Ates Feb 28 '11 at 12:45
Try to replace `lstrip` with `strip`. – Tugrul Ates Feb 28 '11 at 12:50
I understand what you are suggesting and I had done the same error in detail : ascii' codec can't encode character u'\xef' in position 5785: ordinal not in range(128) – mahesh Feb 28 '11 at 12:50
1

It's an encode error during the conversion of an unicode to string during printing. It won't contain a UTF-8 BOM, it can't be decoded back to unicode, and the error is because it countains non-ASCII characters - removing them would *break* the content, and the BOM is only one of them. – Rosh Oxymoron Feb 28 '11 at 12:50

score 30 · Answer 4 · answered Oct 17 '11 at 22:43

30

This worked for me:

from django.utils.encoding import smart_str
content = smart_str(content)

answered Oct 17 '11 at 22:43

Orlando Pozo

311
3
5

score 8 · Answer 5 · answered Feb 28 '11 at 12:23

8

The problem according to your traceback is the print statement on line 136 of parseXML.py. Unfortunately you didn't see fit to post that part of your code, but I'm going to guess it is just there for debugging. If you change it to:

print repr(ch)

then you should at least see what you are trying to print.

answered Feb 28 '11 at 12:23

Duncan

79,697
10
108
148

2

-1 for non-unicode solution to an obvious unicode encoding problem. – Triptych Feb 28 '11 at 18:11
7

The unicode encoding problem is with the print statement. Yes, there may be other issues but fixing the print to not crash is the immediate issue. – Duncan Feb 28 '11 at 18:16

score 7 · Answer 6 · answered Feb 28 '11 at 12:46

7

The problem is that you're trying to print an unicode character to a possibly non-unicode terminal. You need to encode it with the 'replace option before printing it, e.g. print ch.encode(sys.stdout.encoding, 'replace').

answered Feb 28 '11 at 12:46

Rosh Oxymoron

17,904
5
37
43

printing is not essential, the main statement for me where I am getting error is of parse statement – mahesh Feb 28 '11 at 12:59
3

@Mahesh: It's YOUR code that's causing the problem, at line 136 of parseXML.py -- either fix it yourself, or show us that part of the code so we can help you. – John Machin Feb 28 '11 at 17:22

score -1 · Answer 7 · answered Feb 09 '17 at 06:56

-1

An easy solution to overcome this problem is to set your default encoding to utf8. Follow is an example

import sys

reload(sys)
sys.setdefaultencoding('utf8')

answered Feb 09 '17 at 06:56

Hafiz Muhammad Shafiq

6,781
10
49
92

Do not do this. [why it breaks code](https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/) – Mark Tolonen Mar 18 '17 at 03:13
Can you explain the reason? – Hafiz Muhammad Shafiq Mar 20 '17 at 02:33
There is a link in my comment that explains it. Essentially libraries expect the default of `ascii` to remain the default. It is why `setdefaultencoding` is not normally available without the `reload` trick. – Mark Tolonen Mar 20 '17 at 03:34

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

7 Answers7

Linked

Related