0

I have a text Aur\xc3\xa9lien and want to decode it with python 3.8.

I tried the following

import codecs
s = "Aur\xc3\xa9lien"
codecs.decode(s, "urf-8")
codecs.decode(bytes(s), "urf-8")
codecs.decode(bytes(s, "utf-8"), "utf-8")

but none of them gives the correct result Aurélien.

How to do it correctly?

And is there no basic, general authoritative simple page that describes all these encodings for python?

Alex
  • 34,021
  • 64
  • 178
  • 371
  • `s = "Aur\xc3\xa9lien"; b = bytes(s, 'latin-1'); print(b.decode('utf-8'))` – Justin Ezequiel Feb 04 '21 at 15:23
  • Note: your "s" is not really a string, but a sequence of bytes, so you should precede it with a `b`. You are using a special feature of Python (which allow binary characters together Unicode sequence). – Giacomo Catenazzi Feb 04 '21 at 15:48
  • I read that string from a file. How to precede an existing string with a 'b'? – Alex Feb 04 '21 at 15:51
  • How do you read the string from a file? You use probably a wrong `open` command. Which parameter do you use? Usually `open` read a text file, and you should have a unicode strings (with ev. replacement characters,). But on no normal case you get such "string". To have a binary string, just use `'b'` in `open` – Giacomo Catenazzi Feb 04 '21 at 16:02
  • I read it from a csv file. I can try to add that 'b'. But maybe I can change it later? Like using a function: `bytestring = convert_to_bytes(s)`. No? – Alex Feb 04 '21 at 16:23
  • 1
    Note: you should tag people when replying. (You have autotag, because you are the questioner, and I get it when someone reply to my answer). How do you read a csv file? Usually I use `with open('file.cvs', encoding='utf8' as f: for line in f.readlines(): fields=line.split(',')`. But you may be using a module? `csv` module? How do you read the file? [long ago, in earlier 3.x versions csv was buggy regarding Unicode files] – Giacomo Catenazzi Feb 04 '21 at 16:32

3 Answers3

2

First find the encoding of the string and then decode it... to do this you will need to make a byte string by adding the letter 'b' to the front of the original string.

Try this:

import chardet

s = "Aur\xc3\xa9lien"
bs = b"Aur\xc3\xa9lien"

encoding = chardet.detect(bs)["encoding"]

str = s.encode(encoding).decode("utf-8")

print(str)

If you are reading the text from a file you can detect the encoding using the magic lib, see here: https://stackoverflow.com/a/16203777/1544937

Jacob Philpott
  • 392
  • 1
  • 3
  • 14
  • How do I know the original string is encoded in 'latin1'? – Alex Feb 04 '21 at 15:47
  • @Alex I have updated my answer to programmatically detect the encoding. – Jacob Philpott Feb 04 '21 at 16:04
  • 1
    @Alex I also added a link to help you detect the encoding if the text is from a file and not a string in code. – Jacob Philpott Feb 04 '21 at 16:17
  • But the problem is not about detecting the encoding (it is clearly UTF-8). And a Python string in theory has no encoding. The problem is that the Python string has some characters as binary data, not interpreted as unicode code points (which it is an hidden/not very well know feature of Python [and most programmers should never see it]). – Giacomo Catenazzi Feb 05 '21 at 10:38
0

You have UTF-8 decoded as latin-1, so the solution is to encode as latin-1 then decode as UTF-8.

s = "Aur\xc3\xa9lien"
s.encode('latin-1').decode('utf-8')
print(s.encode('latin-1').decode('utf-8'))

Output
Aurélien
mhhabib
  • 2,486
  • 1
  • 9
  • 20
  • How do I know it is 'latin-1'? – Alex Feb 04 '21 at 15:30
  • @Alex In `latin1` each character is exactly one byte long. In `utf8` a character can consist of more than one byte. Consequently, utf8 has more characters than latin1. Further, if you want to know more about it then you can go through this answer. https://stackoverflow.com/questions/2708958/differences-between-utf8-and-latin1 – mhhabib Feb 04 '21 at 15:41
  • But in the actual text there is one character `é` that is 8 bytes long. No? Sorry I do not understand. `é`=`\xc3\xa9` – Alex Feb 04 '21 at 15:46
  • `é` is actually 16 bits or 2 bytes long. You can see this for yourself by assigning it as `bytes` `b = b'\xc3\xa9'`. See that the length is 2 `len(b)`. Get the decimal value of both `bytes`. `byte1 = b[0]` which is `195` and `byte2 = b[1]` is `169`. Then formatting them as binary. `print(f'{byte1:b}')` returns `'11000011'` and `print(f'{byte2:b}')` returns `'10101001'`. Behind the scenes `utf-8` is reading the binary bits and translating them to the characters they're decoded as. Sometimes in chunks of 8 bits. Sometimes more. – Axe319 Feb 04 '21 at 16:55
  • This answer explains it a lot better than I can https://stackoverflow.com/a/27939161/12479639. What it boils down to is, the 8 bits value that it reads from disk implies whether or not to include the next 8 bits or not. – Axe319 Feb 04 '21 at 17:03
0

Your string is not a Unicode sequence, so you should prefix it with b

import codecs
b = b"Aur\xc3\xa9lien"
b.decode('utf-8')

So you have the expected: 'Aurélien'.

If you want to use s, you should use mbcs, latin-1, mac_roman or any 8-bit encoding. It doesn't matter. Such 8-bit codecs can get the binary character in your string correctly (a 1 to 1 mapping). So you get a byte array (and so now you can use the first part of this answers and so you can decode the binary string.

Giacomo Catenazzi
  • 5,665
  • 1
  • 17
  • 25
  • I read that string from a file. How do I precede a that with a 'b'? – Alex Feb 04 '21 at 16:00
  • If you read the string from a file, you should write in your question, and how do you read the string. It is not normal to have such string reading data from a file. Really, it is far for default or expected behaviour of reading files – Giacomo Catenazzi Feb 04 '21 at 16:04
  • And in any case, the second part of the question tell you how to do, if you have a string with binary data. – Giacomo Catenazzi Feb 04 '21 at 16:05