Questions tagged [diacritics]

A Diacritic is "a mark near or through an orthographic or phonetic character or combination of characters indicating a phonetic value different from that given the unmarked or otherwise marked element" -- Merriam-Webster

From Wikipedia:

A diacritic (/daɪ.əˈkrɪtɨk/; also diacritical mark, diacritical point, diacritical sign) is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός (diakritikós, "distinguishing"). Diacritic is both an adjective and a noun, whereas diacritical is only an adjective. Some diacritical marks, such as the acute ( ´ ) and grave ( ` ) are often called accents. Diacritical marks may appear above or below a letter, or in some other position such as within the letter or between two letters.

The main use of diacritics in the Latin alphabet is to change the sound value of the letter to which they are added. Examples from English are the diaeresis in naïve and Noël, which show that the vowel with the diaeresis mark is pronounced separately from the preceding vowel; the acute and grave accents, which indicate that a final vowel is to be pronounced, as in saké and poetic breathèd, and the cedilla under the "c" in the borrowed French word façade, which shows it is pronounced /s/ rather than /k/. In other Latin alphabets, they may distinguish between homonyms, such as French là "there" versus la "the," which are both pronounced [la]. In Gaelic type, a dot over consonants indicates lenition of the consonant in question. In other alphabetic systems, diacritics may perform other functions. Vowel pointing systems, namely the Arabic harakat ( ـَ, ـُ, ـُ, etc.) and the Hebrew niqqud ( ַ, ֶ, ִ, ֹ , ֻ, etc.) systems, indicate sounds (vowels and tones) that are not conveyed by the basic alphabet. The Indic virama ( ् etc.) and the Arabic sukūn ( ـْـ ) mark the absence of a vowel. Cantillation marks indicate prosody. Other uses include the Early Cyrillic titlo ( ◌҃ ) and the Hebrew gershayim ( ״ ), which, respectively, mark abbreviations or acronyms, and Greek diacritics, which showed that letters of the alphabet were being used as numerals.

In orthography and collation, a letter modified by a diacritic may be treated either as a new, distinct letter or as a letter–diacritic combination. This varies from language to language, and may vary from case to case within a language.

In some cases, letters are used as "in-line diacritics" in place of ancillary glyphs, because they modify the sound of the letter preceding them, as in the case of the "h" in English "sh" and "th".

More information

1014 questions
585
votes
10 answers

What is the best way to remove accents (normalize) in a Python unicode string?

I have a Unicode string in Python, and I would like to remove all the accents (diacritics). I found on the web an elegant way to do this (in Java): convert the Unicode string to its long normalized form (with a separate character for letters and…
MiniQuark
  • 40,659
  • 30
  • 140
  • 167
568
votes
30 answers

Remove accents/diacritics in a string in JavaScript

How do I remove accentuated characters from a string? Especially in IE6, I had something like this: accentsTidy = function(s){ var r=s.toLowerCase(); r = r.replace(new RegExp(/\s/g),""); r = r.replace(new RegExp(/[àáâãäå]/g),"a"); r…
glmxndr
  • 42,138
  • 27
  • 90
  • 115
464
votes
22 answers

How do I remove diacritics (accents) from a string in .NET?

I'm trying to convert some strings that are in French Canadian and basically, I'd like to be able to take out the French accent marks in the letters while keeping the letter. (E.g. convert é to e, so crème brûlée would become creme brulee) What is…
James Hall
  • 5,698
  • 6
  • 24
  • 27
282
votes
12 answers

Is there a way to get rid of accents and convert a whole string to regular letters?

Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one? Example: Input: orčpžsíáýd Output: orcpzsiayd It doesn't need to include all letters…
Martin
  • 3,065
  • 3
  • 19
  • 20
189
votes
22 answers

Microsoft Excel mangles Diacritics in .csv files?

I am programmatically exporting data (using PHP 5.2) into a .csv test file. Example data: Numéro 1 (note the accented e). The data is utf-8 (no prepended BOM). When I open this file in MS Excel is displays as Numéro 1. I am able to open this in a…
Freddo411
  • 2,135
  • 3
  • 17
  • 13
135
votes
12 answers

Converting Symbols, Accent Letters to English Alphabet

The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet. For instance here are a few…
AhmetB - Google
  • 35,086
  • 32
  • 117
  • 191
89
votes
12 answers

Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character. For example: ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ --> n á --> a ä --> a ấ --> a ṏ -->…
flybywire
  • 232,954
  • 184
  • 384
  • 491
82
votes
6 answers

Easy way to remove accents from a Unicode string?

I want to change this sentence : Et ça sera sa moitié. To : Et ca sera sa moitie. Is there an easy way to do this in Java, like I would do in Objective-C ? NSString *str = @"Et ça sera sa moitié."; NSData *data = [str…
Rob
  • 14,827
  • 20
  • 65
  • 104
70
votes
22 answers

Listings in Latex with UTF-8 (or at least german umlauts)

Trying to include a source-file into my latex document using the listings package, i got problems with german umlauts inside of the comments in the code. Using \lstset{ extendedchars=\true, inputencoding=utf8x } Umlauts in the source files (encoded…
Janosch
  • 1,184
  • 1
  • 10
  • 19
64
votes
5 answers

Should I use accented characters in URLs?

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge. I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have…
Wabbitseason
  • 5,330
  • 9
  • 46
  • 57
49
votes
2 answers

MacOSX: how to disable accented characters input

I'm using Eclipse Juno on MacOSX Lion and have an issue with typing. I often print one quote/apostrophe and move the caret. But in this Mac version of Eclipse the quote as I type is highlighted by orange marker (it seems like Mac smart quotes…
Tertium
  • 5,477
  • 3
  • 27
  • 49
47
votes
7 answers

PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string

What I want to do is to remove all accents and umlauts from a string, turning "lärm" into "larm" or "andré" into "andre". What I tried to do was to utf8_decode the string and then use strtr on it, but since my source file is saved as UTF-8 file, I…
BlaM
  • 26,721
  • 31
  • 89
  • 104
42
votes
5 answers

Remove accents from String

Is there any way in Android that (to my knowledge) doesn't have java.text.Normalizer, to remove any accent from a String. E.g "éàù" becomes "eau". I'd like to avoid parsing the String to check each character if possible!
Johann
  • 11,420
  • 10
  • 56
  • 82
39
votes
3 answers

How to protect against diacritics such as Zalgo text

The character pictured above was tweeted a few months ago by Mikko Hyppönen, a computer security expert known for his work on computer viruses and TED talks on computer security. In respect for SO, I will only post an image of it, but you get the…
Derek Hunziker
  • 12,518
  • 3
  • 53
  • 103
35
votes
3 answers

Java string searching ignoring accents

I am trying to write a filter function for my application that will take an input string and filter out all objects that don't match the given input in some way. The easiest way to do this would be to use String's contains method, i.e. just check…
DaveJohnston
  • 9,755
  • 10
  • 51
  • 81
1
2 3
67 68