585

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. remove all the characters whose Unicode type is "diacritic".

Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

smci
  • 26,085
  • 16
  • 96
  • 138
MiniQuark
  • 40,659
  • 30
  • 140
  • 167

10 Answers10

571

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'
guival
  • 639
  • 17
  • 22
Christian Oudard
  • 42,650
  • 23
  • 62
  • 69
  • Yeah, this is a better solution than simply stripping the accents. It provides much more useful transliterations for the languages that have conventions for writing words in ASCII. – Paul McMillan Apr 13 '10 at 21:29
  • 83
    Seems to work well with Chinese, but the transformation of the French name "François" unfortunately gives "FranASSois", which is not very good, compared to the more natural "Francois". – Eric O Lebigot Sep 17 '11 at 14:56
  • 11
    depends what you're trying to achieve. for example I'm doing a search right now, and I don't want to transliterate greek/russian/chinese, I just want to replace "ą/ę/ś/ć" with "a/e/s/c" – kolinko Mar 31 '12 at 18:15
  • 64
    @EOL unidecode works for great for strings like "François", if you pass unicode objects to it. It looks like you tried with a plain byte string. – Karl Bartel Apr 30 '12 at 09:38
  • 2
    @EOL It looks like the "C cédille" is now handled properly. So, as far as I tested unidecode, which isn't much, I now consider it gives very good results. – Mathieu Mar 03 '13 at 06:13
  • 28
    Note that unidecode >= 0.04.10 (Dec 2012) is GPL. Use earlier versions or check https://github.com/kmike/text-unidecode if you need a more permissive license and can stand a slightly worse implementation. – Mikhail Korobov Feb 23 '14 at 22:27
  • 2
    Doesn't seem to work with German eg. Ö => O Where it should be Oe – chhantyal Jan 07 '15 at 13:33
  • how to use it with variables? – Liam Nov 29 '15 at 19:12
  • 4
    @chhantyal the Ö => OE is quite German-specific. In Finnish, some words like `ääliö` would render completely unrecognizable `aeaelioe`; it is simply more correct to omit diaresis than to add the `e`, though pronunciation of the accented letter is pretty much on par with the German umlaut. – Antti Haapala Aug 20 '16 at 06:07
  • 4
    @EOL You'll be pleased to know that in the latest version of the library, `'François'` is mapped to `'Francois'` as you'd expect. – Mark Amery Sep 15 '16 at 13:44
  • 18
    `unidecode` replaces `°` with `deg`. It does more than just removing accents. – Eric Duminil Apr 28 '17 at 12:02
  • 5
    People need to understand that Unicode character decomposition is a language specific mapping, it does not work universally and modules like unidecode are never going to work well with ignoring the locale or language of the input. As to CJK characters, it's a childish assumption that you can take an arbitary CJK character and 'render' it with ASCII: CJK characters can have multiple readings both in Chinese and Japanese, and the Chinese, Japanese, etc. readings are also going to be different. These modules are a waste of time. – imrek May 14 '17 at 17:00
  • What if I'm reading a string from a file how do I give it as input to the the library? like u+'str' but that would give me a varible answer name u is not defined – Mohsin Jul 30 '18 at 11:14
309

How about this:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>> 

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

BartoszKP
  • 32,105
  • 13
  • 92
  • 123
oefe
  • 17,294
  • 7
  • 41
  • 65
  • 6
    These are not composed characters, unfortunately--even though "ł" is named "LATIN SMALL LETTER L WITH STROKE"! You'll either need to play games with parsing `unicodedata.name`, or break down and use a look-alike table-- which you'd need for Greek letters anyway (Α is just "GREEK CAPITAL LETTER ALPHA"). – alexis Apr 07 '12 at 11:25
  • 2
    @andi, I'm afraid I can't guess what point you want to make. The email exchange reflects what I wrote above: Because the letter "ł" is not an accented letter (and is not treated as one in the Unicode standard), it does not have a decomposition. – alexis Nov 23 '14 at 00:12
  • 2
    @alexis (late follow-up): This works perfectly well for Greek as well – eg. "GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA" is normalised into "GREEK CAPITAL LETTER ALPHA" just as expected. Unless you are referring to *transliteration* (eg. "α" → "a"), which is not the same as "removing accents"... – lenz May 16 '16 at 07:41
  • @lenz, I wasn't talking about removing accents from Greek, but about the "stroke" on the ell. Since it is not a diacritic, changing it to plain ell is the same as changing Greek Alpha to `A`. If don't want it don't do it, but in both cases you're substituting a Latin (near) look-alike. – alexis May 16 '16 at 17:01
  • Mostly works nice :) But it doesn't transform ```ß``` into ascii ```ss``` in example. I would still use ```unidecode``` to avoid accidents. – Art Mar 01 '17 at 06:53
  • Should probably use `.combining()` to check the property directly, rather than only handling `.category() == 'Mn`, which will mess up – o11c May 05 '17 at 21:46
  • 1
    + for not requiring installing anything – Brambor Nov 24 '20 at 02:05
  • Your solution is excellent! It works even for old church slavonic letters! :) Thank you! – Alexander Perechnev Mar 27 '21 at 20:24
171

I just found this answer on the Web:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)
Community
  • 1
  • 1
MiniQuark
  • 40,659
  • 30
  • 140
  • 167
  • 6
    I had to add 'utf8' to unicode: `nkfd_form = unicodedata.normalize('NFKD', unicode(input_str, 'utf8'))` – Jabba Jan 08 '12 at 23:27
  • @Jabba: `, 'utf8'` is a "safety net" needed if you are testing input in terminal (which by default does not use unicode). But usually you don't *have* to add it, since if you're removing accents then `input_str` is very likely to be utf8 already. It doesn't hurt to be safe, though. – MestreLion Apr 17 '12 at 23:15
  • >>> def remove_accents(input_str): ... nkfd_form = unicodedata.normalize('NFKD', unicode(input_str)) ... return u"".join([c for c in nkfd_form if not unicodedata.combining(c)]) ... >>> remove_accents('é') Traceback (most recent call last): File "", line 1, in File "", line 2, in remove_accents UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) – rbp Jun 09 '13 at 15:40
  • 1
    @rbp: you should pass a unicode string to `remove_accents` instead of a regular string (u"é" instead of "é"). You passed a regular string to `remove_accents`, so when trying to convert your string to a unicode string, the default `ascii` encoding was used. This encoding does not support any byte whose value is >127. When you typed "é" in your shell, your O.S. encoded that, probably with UTF-8 or some Windows Code Page encoding, and that included bytes >127. I'll change my function in order to remove the conversion to unicode: it will bomb more clearly if a non-unicode string is passed. – MiniQuark Jun 11 '13 at 10:11
  • 1
    @MiniQuark that worked perfectly >>> remove_accents(unicode('é')) – rbp Jun 12 '13 at 20:59
  • 1
    This answer gave me the best result on a large data set, the only exception is "ð"- unicodedata wouldn't touch it! – s29 Jun 08 '18 at 02:38
  • The first example removes "ł" ("LATIN SMALL LETTER L WITH STROKE") completely :( – mirek Nov 08 '19 at 12:42
  • In Python 3, the first version of `remove_accents` in this post returns a `bytes`. To return a `str`, you need to call `nfkd_form.encode('ASCII', 'ignore').decode('utf8')` – robertspierre Nov 28 '20 at 17:01
  • doesn't work on `đ` character, it supposed to be `d` – TomSawyer Jan 19 '21 at 07:56
  • works good but note that the first function doesnt work for ß – luky May 18 '21 at 20:22
  • also note that the function returns bytes like b"xxx" not string like "xxx" you have to convert it to string first like str(remove_accents(input_str), 'utf8') https://stackabuse.com/convert-bytes-to-string-in-python/ – luky May 18 '21 at 21:26
50

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

Thanks to you, I have created this function that works wonders.

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

result:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'
hexaJer
  • 675
  • 6
  • 11
  • 3
    With Py2.7, passing an already unicode string errors at `text = unicode(text, 'utf-8')`. A workaround for that was to add`except TypeError: pass` – Daniel Reis Mar 18 '16 at 15:56
28

This handles not only accents, but also "strokes" (as in ø etc.):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(char)
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
        try:
            char = ud.lookup(desc)
        except KeyError:
            pass  # removing "WITH ..." produced an invalid name
    return char

This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed. In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.

There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

EDIT NOTE:

Incorporated suggestions from the comments (handling lookup errors, Python-3 code).

lenz
  • 4,586
  • 4
  • 22
  • 35
  • 8
    You should catch the exception if the new symbol doesn't exist. For example there's SQUARE WITH VERTICAL FILL ▥, but there's no SQUARE. (not to mention that this code transforms UMBRELLA WITH RAIN DROPS ☔ into UMBRELLA ☂). – janek37 Jul 09 '15 at 09:45
  • This looks elegant in harnessing the semantic descriptions of characters that are available. Do we really need the `unicode` function call in there with python 3 though? I think a tighter regex in place of the `find` would avoid all the trouble mentioned in the comment above, and also, memoization would help performance when it's a critical code path. – matanster Dec 29 '18 at 14:30
  • 1
    @matanster no, this is an old answer from the Python-2 era; the `unicode` typecast is no longer appropriate in Python 3. In any case, in my experience there is no universal, elegant solution to this problem. Depending on the application, any approach has its pros and cons. Quality-thriving tools like `unidecode` are based on hand-crafted tables. Some resources (tables, algorithms) are provided by Unicode, eg. for collation. – lenz Dec 29 '18 at 14:45
  • 1
    I just repeat, what is above (py3): 1) unicode(char)->char 2) try: return ud.lookup(desc) except KeyError: return char – mirek Nov 08 '19 at 12:50
  • @mirek you are right: since this thread is so popular, this answer deserves some updating/improving. I edited it. – lenz Nov 08 '19 at 18:22
16

gensim.utils.deaccent(text) from Gensim - topic modelling for humans:

'Sef chomutovskych komunistu dostal postou bily prasek'

Another solution is unidecode.

Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').

Piotr Migdal
  • 9,638
  • 7
  • 52
  • 77
15

In response to @MiniQuark's answer:

I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats. As a test, I created a test.txt file that looked like this:

Montréal, über, 12.89, Mère, Françoise, noël, 889

I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba's comment:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

The result:

Montreal
uber
12.89
Mere
Francoise
noel
889

(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)

aseagram
  • 1,121
  • 12
  • 15
  • 1
    `remove_accents` was meant to remove accents from a unicode string. In case it's passed a byte-string, it tries to convert it to a unicode string with `unicode(input_str)`. This uses python's default encoding, which is "ascii". Since your file is encoded with UTF-8, this would fail. Lines 2 and 3 change python's default encoding to UTF-8, so then it works, as you found out. Another option is to pass `remove_accents` a unicode string: remove lines 2 and 3, and on the last line replace `element` by `element.decode("utf-8")`. I tested: it works. I'll update my answer to make this clearer. – MiniQuark Jun 12 '13 at 19:52
  • Nice edit, good point. (On another note: The real problem I've realised is that my data file is apparently encoded in `iso-8859-1`, which I can't get to work with this function, unfortunately!) – aseagram Jun 12 '13 at 20:11
  • aseagram: simply replace "utf-8" with "iso-8859-1", and it should work. If you're on windows, then you should probably use "cp1252" instead. – MiniQuark Jun 13 '13 at 07:43
  • BTW, `reload(sys); sys.setdefaultencoding("utf-8")` is a dubious hack sometimes recommended for Windows systems; see https://stackoverflow.com/questions/28657010/dangers-of-sys-setdefaultencodingutf-8 for details. – PM 2Ring May 16 '18 at 13:13
6

perfplot

import unicodedata
from random import choice

import perfplot
import regex
import text_unidecode


def remove_accent_chars_regex(x: str):
    return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))


def remove_accent_chars_join(x: str):
    # answer by MiniQuark
    # https://stackoverflow.com/a/517974/7966259
    return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])


perfplot.show(
    setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
    kernels=[
        remove_accent_chars_regex,
        remove_accent_chars_join,
        text_unidecode.unidecode,
    ],
    labels=['regex', 'join', 'unidecode'],
    n_range=[2 ** k for k in range(22)],
    equality_check=None, relative_to=0, xlabel='str len'
)
mo-han
  • 101
  • 2
  • 2
  • 1
    Haha... amazing. All these bits and pieces did actually install. The script did actually run. The graph actually displayed. And it is very similar to yours. `unidecode` actually handles the Chinese characters. And none of the three comes up with the hilarious "FranASSois". – mike rodent Feb 07 '21 at 13:39
4

Some languages have combining diacritics as language letters and accent diacritics to specify accent.

I think it is more safe to specify explicitly what diactrics you want to strip:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))
sirex
  • 3,481
  • 2
  • 26
  • 20
1

If you are hoping to get functionality similar to Elasticsearch's asciifolding filter, you might want to consider fold-to-ascii, which is [itself]...

A Python port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into ASCII equivalents, if they exist.

Here's an example from the page mentioned above:

from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'

EDIT: The fold_to_ascii module seems to work well for normalizing Latin-based alphabets; however unmappable characters are removed, which means that this module will reduce Chinese text, for example, to empty strings. If you want to preserve Chinese, Japanese, and other Unicode alphabets, consider using @mo-han's remove_accent_chars_regex implementation, above.

Eric McLachlan
  • 1,879
  • 15
  • 27