1

I need to convert unicode files to ascii. In case, a letter doesn't exist in ascii, it should be converted to it's closest ascii representation. I'm using the Unidecode tool for it (https://pypi.python.org/pypi/Unidecode). It works fine when I use it in the Python interpreter on the CL (thus, by invoking python and then importing the libraries and then printing the decoded word like this: print unidecode(u'äèß'))

Unfortunately, when I try to use this tool directly on the command line (thus, by doing something like python -c "from unidecode import *; print unidecode(u'äèß')", it only prints gibberish (A$?A"A to be exact, even though it should've printed (and did in the interpreter) aess). This is annoying and I don't know how to solve that issue. I thought it might be due to encoding errors with my Terminal, not being set correctly to utf-8 or something. However, locale in my Terminal printed me the following output:

LANG="de_DE.UTF-8"

LC_COLLATE="de_DE.UTF-8"

LC_CTYPE="de_DE.UTF-8"

LC_MESSAGES="de_DE.UTF-8"

LC_MONETARY="de_DE.UTF-8"

LC_NUMERIC="de_DE.UTF-8"

LC_TIME="de_DE.UTF-8"

LC_ALL="de_DE.UTF-8"

Or, might it be due to Python that has problems with StdIn encoding on the command line? It gave me correct output in the python interpreter, but when invoking python -c not.

Do you guys have an idea?

Community
  • 1
  • 1
conipo
  • 73
  • 7

2 Answers2

0

If you try writing this in a file:

#!/bin/python
from unidecode import *
print unidecode(u'äèß')

[Wani@Linux tmp]$ python tmp.py 
File "tmp.py", line 1
SyntaxError: Non-ASCII character '\xc3' in file tmp.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
[Wani@Linux tmp]$ 

To fix this, you do:

#!/bin/python
#coding: utf8
from unidecode import *; print unidecode(u'äèß')

[Wani@Linux tmp]$ python tmp.py
aeess
[Wani@Linux tmp]$

So, you need to call from command-line like this:

[Wani@Linux tmp]$ python -c "#coding: utf8
from unidecode import *; print unidecode(u'äèß')"
aeess
[Wani@Linux tmp]$ python -c "$(echo -e "#coding: utf8\nfrom unidecode import *; print unidecode(u'äèß')")"
aeess
[Wani@Linux tmp]

Further Reading: Correct way to define Python source code encoding

Community
  • 1
  • 1
Nehal J Wani
  • 13,977
  • 2
  • 54
  • 77
  • I didn't know I was able to do this on two lines, cool! For my application, unutbu's solution seems to work, though. – conipo Feb 02 '14 at 14:27
0

When you type 'äèß' in the terminal, although you see 'äèß', the terminal sees bytes. If your terminal encoding is utf-8, then it sees the bytes

In [2]: 'äèß'
Out[2]: '\xc3\xa4\xc3\xa8\xc3\x9f'

So when you type

python -c "from unidecode import *; print unidecode(u'äèß')"

at the command line, the terminal (assuming utf-8 encoding) sees

python -c "from unidecode import *; print unidecode(u'\xc3\xa4\xc3\xa8\xc3\x9f')"

That's not the unicode you intended to send to Python.

In [28]: print(u'\xc3\xa4\xc3\xa8\xc3\x9f')
äèÃ

There are a number of ways to work around the problem, perhaps in order of convenience:

  1. Let the terminal change äèß to \xc3\xa4\xc3\xa8\xc3\x9f and then decode it as utf-8:

    % python -c "from unidecode import *; print unidecode('äèß'.decode('utf_8'))"
    aess
    
  2. Declare an encoding as shown in Nehal J. Wani's solution:

    % python -c "#coding: utf8
    > from unidecode import *; print unidecode(u'äèß')" 
    aess
    

    This requires writing the command on two lines, however.

  3. Since u'äèß is equivalent to u'\xe4\xe8\xdf' you could avoid the problem by passing u'\xe4\xe8\xdf' instead:

    % python -c "from unidecode import *; print unidecode(u'\xe4\xe8\xdf')"
    aess
    

    The problem with doing it this way (obviously) is you have to figure out the hexadecimal code point values.

  4. Or, you could specify the unicode by name:

    % python -c "from unidecode import *; print unidecode(u'\N{LATIN SMALL LETTER A WITH DIAERESIS}\N{LATIN SMALL LETTER E WITH GRAVE}\N{LATIN SMALL LETTER SHARP S}')"
    aess
    
unutbu
  • 711,858
  • 148
  • 1,594
  • 1,547
  • As I will integrate it in a script that does the conversion for a lot of files, 1 seems to be a really good solution for me! – conipo Feb 02 '14 at 14:25