Python 3 UnicodeEncodeError for characters and smileys in Tweets

Question

I'm making a Twitter API, I get tweets about a specific word (right now it's 'flafel'). Everything is fine except this tweet

b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\xf0\x9f\x98\x82'

I use print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8'))) to see tweets, but this one gives me UnicodeEncodeError every time and if I erase decode() from that line like print ("Tweet info: {}".format(str(tweet.text).encode('utf-8')) I can see the actual tweet like above, but I want to convert that \xf0\x9f\x98\x82 part to a str. I tried everyting, every version of decodes-encodes etc. How can I solve this problem?

Edit: Well I just went to that user's Twitter account to see what is that non-ASCII part, and it turns out it's a smile:

Is it possible to convert that smiley?

Edit2: The codes are;

...
...
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search,
                           q = "flafel",
                           result_type = "recent",
                           include_entities = True,
                           lang = "en").items():

    print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8')))

Have you tried `print ("Tweet info: {}".format(tweet.text.decode('utf-8')))`? Considering `tweet.text` is returning the bytes object you've posted in the question. — Ashwini Chaudhary, May 30 '16 at 14:11
@AshwiniChaudhary All tweets are returns in str type. I tried your suggestion and: AttributeError: 'str' object has no attribute 'decode' — GLHF, May 30 '16 at 14:13
@AshwiniChaudhary The problem is, there are more tweets has non-ascii characters but decode() convert them nicely, except this tweet. ❤❤ ☀ for example some tweets has these characters, but they are converted. — GLHF, May 30 '16 at 14:15
Can you post the actual content of `tweet.text` in the question body, because what you've posted in the question is a bytes object. — Ashwini Chaudhary, May 30 '16 at 14:31
`b'\xf0\x9f\x98\x82'` is the UTF-8 representation of ['FACE WITH TEARS OF JOY' (U+1F602)](http://www.fileformat.info/info/unicode/char/1f602/index.htm) . See [here](http://www.fileformat.info/info/unicode/char/1f602/fontsupport.htm) for fonts that support it. — PM 2Ring, May 30 '16 at 14:34
@PM2Ring Well is that possible to convert it? At least convert smiles like to ':D' or ';D' or ':(' — GLHF, May 30 '16 at 14:36
Let's try to fix your error first. I don't understand why you do `str(tweet.text).encode('utf-8').decode('utf-8')`. Can you post the output of `print(tweet.text.encode('unicode-escape'))` for that tweet? — PM 2Ring, May 30 '16 at 14:50
@PM2Ring Well that part turned to `"\\U0001f602'` which is face with tears of joy of Python version, I saw it from your comment. — GLHF, May 30 '16 at 14:52
Ok. But that doesn't explain why you're doing that weird `str(tweet.text).encode('utf-8').decode('utf-8')` thing. Isn't `tweet.text` already a Unicode string? You may find this article helpful: [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html), which was written by SO veteran Ned Batchelder. — PM 2Ring, May 30 '16 at 15:18
You can get the name of Unicode codepoints using the standard [unicodedata](https://docs.python.org/3/library/unicodedata.html#module-unicodedata)module. And it looks like you can convert emojis in various ways using [Emojipy](https://kaviraj.me/generating-pdf-containing-emoji-python), but I've never used it myself. — PM 2Ring, May 30 '16 at 15:18
Could you be using Windows? And could you show the full error message? — Serge Ballesta, May 30 '16 at 15:30
@SergeBallesta Here: `UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 154-154: Non-BMP character not supported in Tk` — GLHF, May 30 '16 at 15:43
Just search the error message and you will find that your question is neither new nor better phrased than other questions before. — Ulrich Eckhardt, May 30 '16 at 15:49
@UlrichEckhardt Yeah I did lots of times, my question is new. — GLHF, May 30 '16 at 15:52
Well, if you strip off all the irrelevant stuff, like where you got the string from, you will find that the remainder isn't new. That would also get you closer to creating a minimal but complete example, without which your question is off topic. — Ulrich Eckhardt, May 30 '16 at 16:16
@UlrichEckhardt Still, that's your opinion. If you think so flag it instead of commenting. — GLHF, May 30 '16 at 16:22

Serge Ballesta · Accepted Answer · 2016-05-30T16:29:10.547

The problem could arise at the moment you try to use the unicode character \U0001f602 on Windows. Python-3 is fine for converting it from utf-8 to full unicode an back again, but windows is not able to display it.

I tried this piece of code in different ways on a Windows 7 box:

>>> b = b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\xf0\x9f\x98\x82'
>>> u = b.decode('utf8')
>>> u
'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
>>> print(u)

Are here is what happened:

in IDLE (Python GUI interpretor based on Tk), I got this error:

UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 139-139: Non-BMP character not supported in Tk

in a console using a non unicode codepage I got this error:

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f602' in position 139: character maps to <undefined>

(for the attentive reader BMP means here Basic Multilingual Plane)

in a console using utf-8 codepage (chcp 65001) I got no error but a weird display:

>>> u
'And when I\'m thinking about getting the chili sauce on my flafel and the waitr
ess, a Pinay, tells me not to get it cos "hindi yan masarap."ðŸ˜‚'
>>> print(u)
And when I'm thinking about getting the chili sauce on my flafel and the waitres
s, a Pinay, tells me not to get it cos "hindi yan masarap."ðŸ˜‚
>>>

My conclusion is that the error in not in the conversion utf-8 <-> unicode. But it looks that Window Tk version does not support this character, nor any console code page (except for 65001 that simply tries to display the individual utf8 bytes!)

TL/DR: The problem is not in core Python processing nor in the UTF-8 converter, but only at the system conversion that is used to display the character '\U0001f602'

But hopefully, as core Python has no problem in it, you can easily change the offending '\U0001f602' with a ':D' for example with a mere string.replace (after the code shows above):

>>> print (u.replace(U'\U0001f602', ':D'))

And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":D

If you want a special processing for all characters outside the BMP, it is enough to know that the highest code for it is 0xFFFF. So you could use code like that:

def convert(t):
    with io.StringIO() as fd:
        for c in t:  # replace all chars outside BMP with a !
            dummy = fd.write(c if ord(c) < 0x10000 else '!')
        return fd.getvalue()

I see, well is there a way to convert parts like `\xf0\x9f\x98\x82` to ':D' at least? It'll be much better than bytes. — GLHF, May 30 '16 at 15:52
I know that I mean is there a way to find that not-convertable part and convert it to for example ':D', since I can't define all bytes and meanings of them... There must be a way to find auto — GLHF, May 30 '16 at 15:59

score 1 · Answer 2 · answered May 30 '16 at 16:59

As I mentioned in the comments, you can get the names of Unicode codepoints using the standard unicodedata module. Here's a small demo:

import unicodedata as ud

test = ('And when I\'m thinking about getting the chili sauce on my flafel and the '
    'waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001F602')

def convert_special(c):
    if c > '\uffff':
        c = ':{}:'.format(ud.name(c).lower().replace(' ', '_')) 
    return c

def convert_string(s):
    return ''.join([convert_special(c) for c in s])

for s in (test, 'Some special symbols \U0001F30C, ©, ®, ™, \U0001F40D, \u2323'): 
    print('{}\n{}\n'.format(s.encode('unicode-escape'), convert_string(s)))

output

b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\\U0001f602'
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":face_with_tears_of_joy:

b'Some special symbols \\U0001f30c, \\xa9, \\xae, \\u2122, \\U0001f40d, \\u2323'
Some special symbols :milky_way:, ©, ®, ™, :snake:, ⌣

Another option is to test if a character is in the Unicode "Symbol_Other" category. We can do that by replacing the

if c > '\uffff':

test in convert_special with

if ud.category(c) == 'So':

When we do that, we get this output:

b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\\U0001f602'
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":face_with_tears_of_joy:

b'Some special symbols \\U0001f30c, \\xa9, \\xae, \\u2122, \\U0001f40d, \\u2323'
Some special symbols :milky_way:, :copyright_sign:, :registered_sign:, :trade_mark_sign:, :snake:, :smile:

Python 3 UnicodeEncodeError for characters and smileys in Tweets

2 Answers2