1

Current code:

 file.write("\"" + key + "\": " + "\"" + french[key].encode('utf8') + "\"" + ',' + '\n')

where french key values in dictionary look like this:

"YOU_HAVE_COMPLETED_ENROLLMENT": "Vous avez termin\u00e9 l'inscription !"

Getting this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

Tried all the solutions on here but none seem to work.

Matthew Strawbridge
  • 18,016
  • 10
  • 65
  • 86
Justin
  • 21
  • 3
  • same error it looks like still – Justin Oct 04 '17 at 18:23
  • Possible duplicate of [python encoding utf-8](https://stackoverflow.com/questions/15092437/python-encoding-utf-8) – kaza Oct 04 '17 at 18:23
  • looked at that thread and it says to just write it directly without encoding. However, if i remove .encode, it gives me an encode error: 'ascii' codec can't encode character u'\xe9' in position 37: ordinal not in range(128) – Justin Oct 04 '17 at 18:26
  • https://stackoverflow.com/questions/18403898/unicodedecodeerror-utf8-codec-cant-decode-byte-0xc3 – kaza Oct 04 '17 at 18:30
  • @Justin. (1) Are you using python2 or python 3? (2) what is the output of `print(type(french[key]))` (3) what is `file`, and how did you create it? – ekhumoro Oct 04 '17 at 18:59
  • @ekhumoro (1) On mac, getting python version 2.7. (2) output of print returns (3) file is created through this line: file = open(os.path.join(fr_directory, 'strings.json'), 'w+') – Justin Oct 04 '17 at 19:04
  • @Justin. I have tried the code in your question and it does not produce any errors when using python-2.7. Please check it and make sure you copy/paste the actual code that is causing the problem. – ekhumoro Oct 04 '17 at 19:18
  • Possible duplicate of [Python: special characters giving me problems (from PDFminer)](https://stackoverflow.com/questions/6870214/python-special-characters-giving-me-problems-from-pdfminer) – Maxim Egorushkin Oct 04 '17 at 20:36

2 Answers2

1

The solution: Concatenate unicode strings before encoding, then encode the complete string just before writing to a file. The codecs library simplifies this for you.

import codecs

file = codecs.open(os.path.join(fr_directory, 'strings.json'), 'w+', encoding='utf8')
file.write("\"" + key + "\": " + "\"" + french[key] + "\"" + ',' + '\n')

I have opened the file with codecs.open rather than just open, specifying that the file should automatically handle encoding into UTF-8 when you write unicode strings. I have also removed the explicit encoding call you used.

Further explanation:

The keys and values of your dictionary are almost certainly Unicode strings. A "Unicode string" needs to be encoded before it can be written to a file. Most operations in Python 2 assume an ASCII encoding unless told otherwise, and the file objects returned by open are among them. That's why, if you try to write a Unicode string to a file, you'll see an exception:

>>> with open('/tmp/test.txt', 'w') as f:
...    f.write(u"Vous avez termin\xe9 l'inscription !")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 16: ordinal not in range(128)

This error is one that you can fix by encoding the string directly, so this works:

>>> with open('/tmp/test.txt', 'w') as f:
...    f.write(u"Vous avez termin\xe9 l'inscription !".encode('utf-8'))

However, this alone does not solve your problem, because you are trying to build a more complicated string. When you concatenate a Unicode string to a UTF-8 encoded "raw" string, you also get an exception, even when not writing to a file:

>>> u"YOU_HAVE_COMPLETED_ENROLLMENT: " + u"Vous avez termin\xe9 l'inscription !".encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

You can fix this by not encoding either string:

>>> u"YOU_HAVE_COMPLETED_ENROLLMENT: " + u"Vous avez termin\xe9 l'inscription !"
u"YOU_HAVE_COMPLETED_ENROLLMENT: Vous avez termin\xe9 l'inscription !"

But then when you want to write it to a file, you would have to encode the whole thing again:

>>> with open('/tmp/test.txt', 'w') as f:
...    line = u"YOU_HAVE_COMPLETED_ENROLLMENT: " + u"Vous avez termin\xe9 l'inscription !"
...    f.write(line.encode('utf-8'))

But for convenience, the codecs module gives you the tools to not always have to re-encode every time:

>>> import codecs
>>> with codecs.open('/tmp/test.txt', 'w', encoding='utf8') as f:
...    f.write(u"YOU_HAVE_COMPLETED_ENROLLMENT: " + u"Vous avez termin\xe9 l'inscription !")
user108471
  • 2,168
  • 2
  • 23
  • 37
1

you could unicode string using this function

def _parse_value(value):
    if type(value) == str:
        value = value.decode("utf-8", "ignore").strip()
    return value
rachid el kedmiri
  • 1,856
  • 12
  • 32