0

I am trying to write into a file from a collection. The collection has special characters like ¡ which create a problem. For example the content in the collection has details like:

{..., Name: ¡Hi!, ...}

Now I am trying to write the same into a file but I get the error

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa1' in position 0: ordinal not in range(128)

I have tried the using the solutions provided here but in vain. It will be great if someone could help me with this :)

So the example goes like this:

I have a collection which has the following details

{ "_id":ObjectId("5428ead854fed46f5ec4a0c9"), 
   "author":null,
   "class":"culture",
   "created":1411967707.356593,
   "description":null,
   "id":"eba9b4e2-900f-4707-b57d-aa659cbd0ac9",
   "name":"¡Hola!",
   "reviews":[

   ],
   "screenshot_urls":[

   ]
}

Now I try to access the name entry here from the collection and I do that by iterating it over the collection i.e.

f = open("sample.txt","w");

for val in exampleCollection:
   f.write("%s"%str(exampleCollection[val]).encode("utf-8"))

f.close();
Community
  • 1
  • 1
srajappa
  • 385
  • 4
  • 16
  • Did you try the accepted answer in the link provided? – blackmamba Oct 23 '15 at 18:36
  • Can you show us more details about how you actually encode the collection and it's not working? – coneyhelixlake Oct 23 '15 at 18:36
  • Thanks for your feedback, I have edited the question and provided and example. - @blackmamba – srajappa Oct 23 '15 at 18:46
  • I have edited and added a code snippet for better understanding the question that I asked. Sorry about posting an ambiguous question. - @bourbaki4481472 – srajappa Oct 23 '15 at 18:48
  • Possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) – Pavan Gupta Oct 23 '15 at 19:14

4 Answers4

2

The easiest way to remove characters you don't want is to specify the characters you do.

>>> import string
>>> validchars = string.ascii_letters + string.digits + ' '
>>> s = '¡Hi there!'
>>> clean = ''.join(c for c in s if c in validchars)
>>> clean
'Hi there'

If some forms of punctuation are okay, add them to validchars.

BlivetWidget
  • 7,882
  • 1
  • 12
  • 23
1

This will remove all the characters in the string which are not valid ASCII.

>>> '¡Hola!'.encode('ascii', 'ignore').decode('ascii')
'Hola!'

Alternatively, you can write the file as UTF-8, which can represent nearly all characters on Earth.

Community
  • 1
  • 1
Zenadix
  • 11,375
  • 3
  • 21
  • 39
0

As one user posted on this page, you should take a look at the Unicode tutorial in the docs: https://docs.python.org/2/howto/unicode.html

What's happening is you're trying to use a character that's outside the ASCII range, which is a mere 128 symbols. There's a really great article on this I found a while back, which I'll try to find and post here.

Edit: ah, here it is: http://www.joelonsoftware.com/articles/Unicode.html

Community
  • 1
  • 1
McGlothlin
  • 1,934
  • 15
  • 23
0

You're trying to convert unicode to ascii in "strict" mode:

>>> help(str.encode)
Help on method_descriptor:

encode(...)
    S.encode([encoding[,errors]]) -> object

    Encodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
    'xmlcharrefreplace' as well as any other name registered with
    codecs.register_error that is able to handle UnicodeEncodeErrors.

You probably want something like one of the following:

s = u'¡Hi there!'

print s.encode('ascii', 'ignore')    # removes the ¡
print s.encode('ascii', 'replace')   # replaces with ?
print s.encode('ascii','xmlcharrefreplace') # turn into xml entities
print s.encode('ascii', 'strict')    # throw UnicodeEncodeErrors
Pavan Gupta
  • 12,493
  • 4
  • 16
  • 27