9

I have a unicode string retrieved from a webservice using the requests module, which contains the bytes of a binary document (PCL, as it happens). One of these bytes has the value 248, and attempting to base64 encode it leads to the following error:

In [68]: base64.b64encode(response_dict['content']+'\n')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
C:\...\<ipython-input-68-8c1f1913eb52> in <module>()
----> 1 base64.b64encode(response_dict['content']+'\n')

C:\Python27\Lib\base64.pyc in b64encode(s, altchars)
     51     """
     52     # Strip off the trailing newline
---> 53     encoded = binascii.b2a_base64(s)[:-1]
     54     if altchars is not None:
     55         return _translate(encoded, {'+': altchars[0], '/': altchars[1]})

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 272: ordinal not in range(128)

In [69]: response_dict['content'].encode('base64')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
C:\...\<ipython-input-69-7fd349f35f04> in <module>()
----> 1 response_dict['content'].encode('base64')

C:\...\base64_codec.pyc in base64_encode(input, errors)
     22     """
     23     assert errors == 'strict'
---> 24     output = base64.encodestring(input)
     25     return (output, len(input))
     26

C:\Python27\Lib\base64.pyc in encodestring(s)
    313     for i in range(0, len(s), MAXBINSIZE):
    314         chunk = s[i : i + MAXBINSIZE]
--> 315         pieces.append(binascii.b2a_base64(chunk))
    316     return "".join(pieces)
    317

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 44: ordinal not in range(128)

I find this slightly surprising, because 248 is within the range of an unsigned byte (and can be held in a byte string), but my real question is: what is the best or right way to encode this string?

My current work-around is this:

In [74]: byte_string = ''.join(map(compose(chr, ord), response_dict['content']))

In [75]: byte_string[272]
Out[75]: '\xf8'

This appears to work correctly, and the resulting byte_string is capable of being base64 encoded, but it seems like there should be a better way. Is there?

Marcin
  • 44,601
  • 17
  • 110
  • 191
  • 1
    248 may be within the range of an unsigned byte, but it's not in the range of standardized ASCII [0-127]. – Cameron Mar 05 '12 at 19:00
  • @Cameron: A true and good point, but it still doesn't explain the problem, as the exact same value, when held in a byte string does not result in that error. – Marcin Mar 05 '12 at 19:03
  • See my answer :-) What you've done is take the codepoints of the `unicode` string and treat them as bytes. This is... fishy at best, since you're there's no guarantee the codepoints are even within the range 0-255. What's even worse is that nobody else will know how to interpret the byte string later on, since it's in a custom, undefined encoding. – Cameron Mar 05 '12 at 19:15
  • 1
    @Cameron: to reiterate: these data are not character code points, they are binary data. – Marcin Mar 05 '12 at 19:19

5 Answers5

18

You have a unicode string which you want to base64 encode. The problem is that b64encode() only works on bytes, not characters. So, you need to transform your unicode string (which is a sequence of abstract Unicode codepoints) into a byte string.

The mapping of abstract Unicode strings into a concrete series of bytes is called encoding. Python supports several encodings; I suggest the widely-used UTF-8 encoding:

byte_string = response_dict['content'].encode('utf-8')

Note that whoever is decoding the bytes will also need to know which encoding was used to get back a unicode string via the complementary decode() function:

# Decode
decoded = byte_string.decode('utf-8')

A good starting point for learning more about Unicode and encodings is the Python docs, and this article by Joel Spolsky.

Cameron
  • 86,330
  • 19
  • 177
  • 216
  • 1
    To be clear: the contents of my unicode string are binary data. I cannot change them to some different bytes. Is there an identity encoding? – Marcin Mar 05 '12 at 19:11
  • 2
    @Marcin: You cannot have a `unicode` string containing binary data. That's a contradiction in terms! If the `unicode` string's bytes are supposed to represent binary data (as seems to be the case here), then it shouldn't be stored in a `unicode` object as it's not really Unicode at all! – Cameron Mar 05 '12 at 19:31
  • Why don't add BOM? Actually this feature helps detecting if a string is UTF-8 or not. – sebix Sep 03 '15 at 07:48
  • @sebix: I think it's best if BOMs are usually only used at the start of files; the overhead and complexity of having to check strings everywhere for a BOM seems too high. I got the encoding mixed up, though, the `-sig` one *does* add the BOM. – Cameron Sep 03 '15 at 11:46
5

I would suggest first encoding it to something like UTF-8 before base64 encoding:

In [12]: my_unicode = u'\xf8'

In [13]: my_utf8 = my_unicode.encode('utf-8')

In [15]: base64.b64encode(my_utf8)
Out[15]: 'w7g='
Simon Jagoe
  • 116
  • 3
  • *encoding to UTF-8* does not make sense. either you encode from UTF-8 to bytes/ascii or you decode from ascii to UTF-8. it's the other way round. – sebix Sep 03 '15 at 07:49
3

Since you are working with binary data, I'm not sure that it's a good idea to use the utf-8 encoding. I guess it depends on how you intend to use the base64 encoded representation. I think it would probably be better if you can retrieve the data as a bytes string and not a unicode string. I have never used the requests library, but browsing the documentation suggests that it is possible. There are sections talking about "Binary Response Content" and "Raw Response Content".

Dan Gerhardsson
  • 1,769
  • 12
  • 12
  • Thanks! It turns out that encoding as latin-1 yields the exact same sequence of bytes as my workaround. – Marcin Mar 05 '12 at 19:23
  • 1
    @Marcin: You need to make sure that the requests module hasn't assumed that you are working with text, applied a default encoding, and decoded your binary data to unicode. If that is the case you've got trouble. Can you verify that the content is what you expect? – Dan Gerhardsson Mar 05 '12 at 19:31
  • 2
    Having paid a little bit more attention to the docs, it turns out that requests also tells me the encoding that is used to decode the response to unicode, so I can reliably always re-encode with that (and that once again yields the same bytes). – Marcin Mar 05 '12 at 19:38
1

It should be possible to get the response as binary bytes and skip the decoding and encoding steps entirely. There's always a possibility that requests will choose an encoding that loses some data or errors out in the round trip.

This part of the docs called "Binary Response Content" seems to fit your problem perfectly.

Mark Ransom
  • 271,357
  • 39
  • 345
  • 578
0

If it's binary data...why encode/decode at all? Specially the "base64.encodestring" part. Below is how I encode images into base64 for adding directly into my python code instead of having extra files. 2.7.2 btw

import base64
iconfile = open("blah.icon","rb")
icondata = iconfile.read()
icondata = base64.b64encode(icondata)