Remove letter accents from a given text

Question

Maybe I'm missing something obvious, but is there a "painless" way to replace the accented letters in a given text with their unaccented counterparts? I can only use the standard ANSI C libraries/headers, so my hands are tied. What I've tried so far:

unsigned char currentChar;

(...)

if (currentChar == 'à') { 
    currentChar = 'a'; 
}
else if (currentChar == 'è' || currentChar == 'é') {
    currentChar = 'e'; 
}
else if (...)

However, this doesn't work. Detecting accented vowels with their extended ASCII value isn't an option, either, as I've noticed that it changes depending upon the system locale.

Any hints/suggestions?

(update)

Thanks for the answers, but I'm not really asking for the best approach for this problem - I'll think about it later. I'm simply asking for a way to detect the accented vowels, as the code above simply ignores them.

(update #2)

Okay. Let me clarify:

#include <stdio.h>

int main(void) {
    int i;
    char vowels[6] = {'à','è','é','ì','ò','ù'};
    for (i = 0; i < 6; i++) {
        switch (vowels[i]) {
            case 'à': vowels[i] = 'a'; break;
            case 'è': vowels[i] = 'e'; break;
            case 'é': vowels[i] = 'e'; break;
            case 'ì': vowels[i] = 'i'; break;
            case 'ò': vowels[i] = 'o'; break;
            case 'ù': vowels[i] = 'u'; break;
        }
     }
     printf("\n");
     for (i = 0; i < 6; i++) {
         printf("%c",vowels[i]);
     }
     printf("\n");
     return 0;
}

This code still prints "àèéìòù" as its output. This is my problem. I appreciate the answers, however it's pointless to tell me to implement a conversion map, or a switch/case structure. I'll think about it later.

While maybe not a solution to your problem, I would at least use `switch` instead of lot's of `if...else if` statements. — Some programmer dude, Nov 05 '12 at 18:53
http://stackoverflow.com/questions/12991207/efficently-replacement-of-all-unsupported-chars-in-a-string/12991244#12991244 — SomeWittyUsername, Nov 05 '12 at 18:56
What is ANSI C for you? AFAIK, ANSI is in sync with the ISO standards committee about the revisions of C, with the current revision being C11. Do you mean that or C89? — Jens Gustedt, Nov 05 '12 at 19:05
Nancy, how are you compiling? When I compile your code I get many ` warning: multi-character character constant` - ie. your accented characters are not single chars — William Morris, Nov 05 '12 at 19:18
@WilliamMorris: good observation, the source code encoding is probably UTF-8 but the compiler doesn't support that and sees the characters as multi-byte "character" constants (which don't compare equal to characters). — Greg Hewgill, Nov 05 '12 at 19:19

score 3 · Answer 1 · edited May 23 '17 at 12:09

The accented characters are likely part of the UTF-8 character set, or some other encoding. Your program is using the char type, which usually uses the ASCII character set.

In the ASCII character set, each character is represented by a single byte. This character set does not include the accent character.

Other encodings do include the character, but it is probably not represented by a single byte and so cannot be processed by your code. The solution to this is usually to use wide characters.

What you will need are wide characters.

This question may has more general explanation.

This question may provide a solution for your case.

This code seems to do what you would like:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(int argc, char **argv){
    setlocale(LC_CTYPE, "");
    FILE *f = fopen(argv[1], "r");
    if (!f)
        return 1;

    for (wchar_t c; (c = fgetwc(f)) != WEOF;){
        switch (c) {
            case L'à': c=L'a'; break;
            case L'è': c=L'e';break;
            case L'é': c=L'e';break;
            case L'ì': c=L'i';break;
            case L'ò': c=L'o';break;
            case L'ù': c=L'u';break;
            default:    break;
        }
        wprintf(L"%lc", c);
    }

    fclose(f);
    return 0;
}

@NancyB., I've posted a block of code that seems to do what you would like. — Richard, Nov 05 '12 at 19:48
And just made a quick correction so that it reads `wprintf(L"%c", c);`. I've tested and it works for me. — Richard, Nov 05 '12 at 19:52

score 1 · Answer 2 · answered Nov 05 '12 at 18:53

1

There may be an easier way, some existing functionality that I haven't heard of, but as far as structure, this is how I'd approach it:

Build a table of character conversions consisting of the accent character and the resulting character. Then build a simple loop to scan the table for each character, and if found, make the change.

answered Nov 05 '12 at 18:53

Jonathan Wood

59,750
65
229
380

Okay, but the problem I've encountered is - the accented vowels simply aren't recognized. I know that isalpha() doesn't handle them, so I've implemented some explicit checks, but they don't seem to be able to detect the vowels themselves. – Nancy B. Nov 05 '12 at 18:56
My suggested structure does not require `isalpha()` or any other CRT code. Please read my suggestion again and let me know if there are parts you do not understand. – Jonathan Wood Nov 05 '12 at 19:02
@NancyB.: yeah, `is*()` from `ctype.h` doesn't handle them.. you can want to look at `wchar.h' routines. – Jack Nov 05 '12 at 19:06

AndersK · Answer 3 · 2012-11-05T19:25:13.033

1

If you write

if ( currentChar == (unsigned char)('è'))...

your approach should work given your constraint of only using std c libraries I don't see how you otherwise pull it off.

edited Nov 05 '12 at 19:25

answered Nov 05 '12 at 19:13

AndersK

33,910
6
56
81

1

This might work, but only because the accented chars are composed of two bytes and the cast is telling the compiler to use one of these bytes). So on my system, 'à' is represented by 0xC3 and 0xA0, 'è' is 0xC3, 0xA8 etc. The cast is telling the `if` to look only at the value 0xA8. Also, the solution will likely fail on a machine with the opposite endianness. – William Morris Nov 05 '12 at 19:43

score 1 · Answer 4 · answered Apr 04 '16 at 18:00

1

Lets try this one:

char p_RemoveAccent(char C)
{
    #define ACCENT_CHARS    "ÁÀÃÂÇáàãâçÉÊéêÍíÑÓÔÕñóôõÚÜúü"
    #define UNACCENT_CHARS  "AAAACaaaacEEeeIiNOOOnoooUUuu"

    const char *p_Char = memchr(ACCENT_CHARS, C, sizeof(ACCENT_CHARS));

    return (p_Char ? UNACCENT_CHARS[(p_Char - ACCENT_CHARS)] : C);
}

answered Apr 04 '16 at 18:00

Fernando Barboza

11
1

Nice but OP clearly stated ".. I'm not really asking for the best approach for this problem - I'll think about it later. I'm simply asking for a way to **detect** the accented vowels .." -- OPs source encoding is hiding them. – Jongware Apr 04 '16 at 18:05
1

Doesn't work because accented chars are longer than 1 byte so `UNACCENT_CHARS[(p_Char - ACCENT_CHARS)]` will hit the wrong unaccented char – jsallaberry Aug 17 '16 at 17:43

Remove letter accents from a given text

4 Answers4