replace non-ASCII characters like "ó" with their equivallent UTF-8 format "o"

Question

I have a list that looks like this:

name_list=['ramon del rio,georgina genes,jorge lópez']

And I want to create a byte array. To do this I am running the following code,

for i in name_list:
    name_list_bytes.append(list(map(lambda x: str.encode(x, "UTF-8"), i.split(','))))

print(name_list_bytes)

[b'ramon del rio', b'georgina genes', b'jorge l\xf3pez']

As you can see the name "jorge lópez" is transformed "to "jorge l\xf3pez". How can I overpass this transformation and transform the name correctly?

[EDIT]

I found out that python encode function has a 2nd argument that controls the characters and what should python do when those characters are present in the string.

for i in name_list:
    name_list_bytes.append(list(map(lambda x: str.encode(x, "ascii", "ignore"), i.split(','))))

print(name_list_bytes)

[b'ramon del rio', b'georgina genes', b'jorge lpez'] #removes the unknown asscii character.

The "ignore" arguments removes the ascii characters, although I am looking for replacing them with the proper value. I guess that the best way although tedious is to identify those characters and replace them by hand.

That's just how byte objects are printed. Note the `b` before the opening quotes. — Simon Crane, Jul 04 '20 at 18:44
@SimonCrane Ok so it's correct? But the printed function yields that result? — NikSp, Jul 04 '20 at 18:45
Yes. If you print `len(name_list_bytes[2])` you will see that it has the correct number of characters — Simon Crane, Jul 04 '20 at 18:52
@SimonCrane No that's not true. I am not looking for the length. I am looking to get rid of the "\xf3" character... When I run ```print(name_list_bytes[2])``` I get the same output. — NikSp, Jul 04 '20 at 18:56

NikSp · Answer 1 · 2020-07-05T09:19:53.487

I found the unidecode package after looking at this question that perfectly does the correct replacement of non-ASCII characters. So due to duplication, the question is closed .

import unidecode
name_list=['ramon del rio,georgina genes,jorge lópez']
final_list=[]

for i in name_list:
    final_list.append(list(map(lambda x: str.encode(unidecode.unidecode(x)), i.split(','))))

final_list
[[b'ramon del rio', b'georgina genes', b'jorge lopez']]

replace non-ASCII characters like "ó" with their equivallent UTF-8 format "o"

1 Answers1