Rules from accented letters to ascii ones

Question

Is there a rule that helps to find the UTF-8 codes of all accented letters associated to an ascii one ? For example, can I have all the UTF-8 codes all the accented letters é, è,... from the UTF-8 code of the letter e?

Here is a showcase in Python 3 using the solution given above by Ramchandra Apte

import unicodedata

def accented_letters(letter):
    accented_chars = []

    for accent_type in "acute", "double acute", "grave", "double grave":
        try:
            accented_chars.append(
                unicodedata.lookup(
                    "Latin small letter {letter} with {accent_type}" \
                    .format(**vars())
                )
            )

        except KeyError:
            pass

    return accented_chars

print(accented_letters("e"))


for kind in ["NFC", "NFKC", "NFD", "NFKD"]:
    print(
        '---',
        kind,
        list(unicodedata.normalize(kind,"é")),
        sep = "\n"
    )

for oneChar in "βεέ.¡¿?ê":
    print(
        '---',
        oneChar,
        unicodedata.name(oneChar),

Find characters that are similar glyphically in Unicode?

        unicodedata.normalize('NFD', oneChar).encode('ascii','ignore'),
        sep = "\n"
    )

The corresponding output.

['é', 'è', 'ȅ']
---
NFC
['é']
---
NFKC
['é']
---
NFD
['e', '́']
---
NFKD
['e', '́']
---
β
GREEK SMALL LETTER BETA
b''
---
ε
GREEK SMALL LETTER EPSILON
b''
---
έ
GREEK SMALL LETTER EPSILON WITH TONOS
b''
---
.
FULL STOP
b'.'
---
¡
INVERTED EXCLAMATION MARK
b''
---
¿
INVERTED QUESTION MARK
b''
---
?
QUESTION MARK
b'?'
---
ê
LATIN SMALL LETTER E WITH CIRCUMFLEX
b'e'

Technical informations about UTF-8 (reference given by cjc343)

http://tools.ietf.org/html/rfc3629

I'm sure there are tools available for this, but you haven't said what language you're using. — r3mainer, Oct 25 '13 at 15:44
The problem is these are not always actually associated. This is perceived. They are distinct symbols depending on the human language. Stop thinking in ASCII if you can. — uchuugaka, Oct 25 '13 at 15:48
Indeed, I would like to produce a tool to clean name of files automatically generated like for example a MP3 file using the title which can contain special character. I've made a naïve tool for which uses a dictionary from non ascii to ascii but it is not maintainable without a lot of work. — , Oct 25 '13 at 15:52
You might be interested in this answer: http://stackoverflow.com/a/4848748/235698 — Mark Tolonen, Oct 26 '13 at 06:05
Very interseting. This answer http://stackoverflow.com/questions/4846365/find-characters-that-are-similar-glyphically-in-unicode?answertab=votes#tab-top just completes this post. — , Oct 26 '13 at 08:31

score 1 · Answer 1 · answered Oct 25 '13 at 15:51

They're often supposed to be distinct characters in many languages. However if you really need this, you will want to find a function that normalizes strings. In thus case you will want to normalize to get decomposed characters where these become two Unicode code points in the string.

Ramchandra Apte · Accepted Answer · 2013-10-26T05:58:20.753

0

Using unicodedata.lookup:

import unicodedata

def accented_letters(letter):
    accented_chars = []
    for accent_type in "acute", "double acute", "grave", "double grave":
        try:
            accented_chars.append(unicodedata.lookup("Latin small letter {letter} with {accent_type}".format(**vars())))
        except KeyError:
            pass
    return accented_chars

print(accented_letters("e"))

To do the reverse, one can use unicodedata.normalize with the NFD form and take the first character, as the second character is the combining form accent.

print(unicodedata.normalize("NFD","è")[0]) # Prints "e".

edited Oct 26 '13 at 05:58

answered Oct 25 '13 at 15:58

Ramchandra Apte

3,835
2
22
39

This seems to do the job. Great ! I will try to find documentations about how the UTF-8 works. – Oct 25 '13 at 16:33
``unicodedata.normalize`` does not clean the accented, but ``unicodedata.name("ê")`` gives ``LATIN SMALL LETTER E WITH CIRCUMFLEX`` where a litlle very basic parsing can give the ASCII letter. – Oct 25 '13 at 16:42
@projetmbc I've added the specific code on how to remove the accent using `unicodedata.normalize`. – Ramchandra Apte Oct 26 '13 at 05:58
Thanks. Since I did not test the type of the output `unicodedata.normalize` so I just do think that the output is a string. I'm a little "stupid" ! Thanks for the tip. – Oct 26 '13 at 08:22
@projetmbc You aren't stupid, actually both of them are strings, but the one returned in my solution is actually a two-character string, composed of e and the combining accent character. – Ramchandra Apte Oct 26 '13 at 08:33
I'm kidding a little. Indeed, I use LaTeX and I already met theis kind of two characters combinaison in PDF. – Oct 26 '13 at 08:42

Rules from accented letters to ascii ones

Here is a showcase in Python 3 using the solution given above by Ramchandra Apte

Find characters that are similar glyphically in Unicode?

Technical informations about UTF-8 (reference given by cjc343)

2 Answers2