Encoding issue for Python tool Unidecode on CL

Question

I need to convert unicode files to ascii. In case, a letter doesn't exist in ascii, it should be converted to it's closest ascii representation. I'm using the Unidecode tool for it (https://pypi.python.org/pypi/Unidecode). It works fine when I use it in the Python interpreter on the CL (thus, by invoking python and then importing the libraries and then printing the decoded word like this: print unidecode(u'äèß'))

Unfortunately, when I try to use this tool directly on the command line (thus, by doing something like python -c "from unidecode import *; print unidecode(u'äèß')", it only prints gibberish (A$?A"A to be exact, even though it should've printed (and did in the interpreter) aess). This is annoying and I don't know how to solve that issue. I thought it might be due to encoding errors with my Terminal, not being set correctly to utf-8 or something. However, locale in my Terminal printed me the following output:

LANG="de_DE.UTF-8"

LC_COLLATE="de_DE.UTF-8"

LC_CTYPE="de_DE.UTF-8"

LC_MESSAGES="de_DE.UTF-8"

LC_MONETARY="de_DE.UTF-8"

LC_NUMERIC="de_DE.UTF-8"

LC_TIME="de_DE.UTF-8"

LC_ALL="de_DE.UTF-8"

Or, might it be due to Python that has problems with StdIn encoding on the command line? It gave me correct output in the python interpreter, but when invoking python -c not.

Do you guys have an idea?

score 0 · Answer 1 · edited May 23 '17 at 12:11

If you try writing this in a file:

#!/bin/python
from unidecode import *
print unidecode(u'äèß')

[Wani@Linux tmp]$ python tmp.py 
File "tmp.py", line 1
SyntaxError: Non-ASCII character '\xc3' in file tmp.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
[Wani@Linux tmp]$

To fix this, you do:

#!/bin/python
#coding: utf8
from unidecode import *; print unidecode(u'äèß')

[Wani@Linux tmp]$ python tmp.py
aeess
[Wani@Linux tmp]$

So, you need to call from command-line like this:

[Wani@Linux tmp]$ python -c "#coding: utf8
from unidecode import *; print unidecode(u'äèß')"
aeess
[Wani@Linux tmp]$ python -c "$(echo -e "#coding: utf8\nfrom unidecode import *; print unidecode(u'äèß')")"
aeess
[Wani@Linux tmp]

Further Reading: Correct way to define Python source code encoding

I didn't know I was able to do this on two lines, cool! For my application, unutbu's solution seems to work, though. — conipo, Feb 02 '14 at 14:27

unutbu · Accepted Answer · 2014-02-02T13:25:40.457

When you type 'äèß' in the terminal, although you see 'äèß', the terminal sees bytes. If your terminal encoding is utf-8, then it sees the bytes

In [2]: 'äèß'
Out[2]: '\xc3\xa4\xc3\xa8\xc3\x9f'

So when you type

python -c "from unidecode import *; print unidecode(u'äèß')"

at the command line, the terminal (assuming utf-8 encoding) sees

python -c "from unidecode import *; print unidecode(u'\xc3\xa4\xc3\xa8\xc3\x9f')"

That's not the unicode you intended to send to Python.

In [28]: print(u'\xc3\xa4\xc3\xa8\xc3\x9f')
Ã¤Ã¨Ã

There are a number of ways to work around the problem, perhaps in order of convenience:

Let the terminal change äèß to \xc3\xa4\xc3\xa8\xc3\x9f and then decode it as utf-8:

% python -c "from unidecode import *; print unidecode('äèß'.decode('utf_8'))"
aess

Declare an encoding as shown in Nehal J. Wani's solution:
```
% python -c "#coding: utf8
> from unidecode import *; print unidecode(u'äèß')" 
aess
```
This requires writing the command on two lines, however.
Since u'äèß is equivalent to u'\xe4\xe8\xdf' you could avoid the problem by passing u'\xe4\xe8\xdf' instead:
```
% python -c "from unidecode import *; print unidecode(u'\xe4\xe8\xdf')"
aess
```
The problem with doing it this way (obviously) is you have to figure out the hexadecimal code point values.

Or, you could specify the unicode by name:

% python -c "from unidecode import *; print unidecode(u'\N{LATIN SMALL LETTER A WITH DIAERESIS}\N{LATIN SMALL LETTER E WITH GRAVE}\N{LATIN SMALL LETTER SHARP S}')"
aess

As I will integrate it in a script that does the conversion for a lot of files, 1 seems to be a really good solution for me! — conipo, Feb 02 '14 at 14:25

Encoding issue for Python tool Unidecode on CL

2 Answers2