Efficently Replacement of all unsupported chars in a String

Question

Possible Duplicate:
Converting Symbols, Accent Letters to English Alphabet

I need to replace all accented characters, such as

"à", "é", "ì", "ò", "ù"

with

"a'", "e'", "i'", "o'", "u'"...

because of an issue with reloading nested strings with accented characters after they've been saved.

Is there a way to do this without using different string replacement for all chars?

For example, I would prefer to avoid doing

text  = text.replace("a", "a'");
text2 = text.replace("è", "e'");
text3 = text2.replace("ì", "i'");
text4 = text3.replace("ò", "o'");
text5 = text4.replace("ù", "u'");

etc.

If solving the "issue with reloading nested strings with accented characters" is beyond your capabilities or time restraints, maybe an easier way to avoid this character set/encoding problem would be to store the strings base64 encoded. You could use http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html — goat, Oct 20 '12 at 18:22
there looks to be an api Normalize.normalize to do just such a thing. Check this [post](http://stackoverflow.com/questions/1008802/converting-symbols-accent-letters-to-english-alphabet) — nandeesh, Oct 20 '12 at 18:26
@nandeesh: I think the OP is trying to preserve the accented characters. — Bhesh Gurung, Oct 20 '12 at 18:27
@BheshGurung Normalize api does convert to english letters from what i understand, i have never tried it though — nandeesh, Oct 20 '12 at 18:29
@rambocoder Basically I need to use this workaround because I haven't found a valid solution to this http://stackoverflow.com/questions/12990041/fileoutputstream-cause-string-char-issue — AndreaF, Oct 20 '12 at 18:30
@nandeesh: Yes I think that's what it does but the problem would be that it can be written to the file after normalization, and read back, but then there won't be enough information to get those accented characters back. May be that's what the problem is. — Bhesh Gurung, Oct 20 '12 at 18:33
@BheshGurung i had misunderstood the problem, but now that i have read it correctly, i have posted an answer, that seems to work — nandeesh, Oct 20 '12 at 18:35

Guido Simone · Answer 1 · 2012-10-20T18:59:10.487

If you don't mind adding commons-lang as a dependency, try StringUtils.replaceEach I believe the following perform the same task:

import org.apache.commons.lang.StringUtils;

public class ReplaceEachTest
{
   public static void main(String [] args)
   {
      String text = "àéìòùàéìòù";
      String [] searchList = {"à", "é", "ì", "ò", "ù"};
      String [] replaceList = {"a'", "e'", "i'", "o'", "u'"};
      String newtext = StringUtils.replaceEach(text, searchList, replaceList);
      System.out.println(newtext);
   }
}

This example prints a'e'i'o'u'a'e'i'o'u' However in general I agree that since you are creating a custom character translation, you will need a solution where your explicitly specify the replacement for each character of interest.

My previous answer using replaceChars is no good because it only handles one-to-one character replacement.

Doh! You are correct. Thanks. Will update my answer. – Guido Simone Oct 20 '12 at 18:53 — Guido Simone, Oct 20 '12 at 18:53

score 4 · Accepted Answer · edited May 23 '17 at 12:18

4

I tried this from this post it seems to work.

String str= Normalizer.normalize(str, Normalizer.Form.NFD);
str= str.replaceAll("\\p{InCombiningDiacriticalMarks}+", "'");

Edit: But replacing the Combining diacritical marks, has a side effect that you cannot distinguish between À Á Â

edited May 23 '17 at 12:18

Community

1
1

answered Oct 20 '12 at 18:32

nandeesh

24,272
6
65
74

but replace the accented chars à with a' or simply the accented chars à becomes a ?? – AndreaF Oct 20 '12 at 18:35
the first answer replaced à with a, I had misunderstood the problem, but i edited it , i changed the replaceall line. do check it – nandeesh Oct 20 '12 at 18:37
+1 I like this approach. The regular expression engine is also generally quite efficient at this sort of operation (no backtracking hullaballoo). – Oct 20 '12 at 18:39
however what happen If I use this with a Chinese input? – AndreaF Oct 20 '12 at 18:56
i dont think it does any transformation. Normalize only changes the accented alphabets – nandeesh Oct 20 '12 at 18:57
+1. This is a lot better than trying do it manually. – Bhesh Gurung Oct 20 '12 at 20:16

score 3 · Answer 3 · 2012-10-20T18:44:23.137

After reading the comments in the main approach, I think a better option would be fix the problem - which is encoding related? - and not try to cover up the symptoms.

Also, this still requires a manual explicit mapping, which might make it less ideal than nandeesh's answer with a regular expression unicode character class.

Here is a skeleton for code to perform the mapping. It is slightly more complicated than a char-char.

This code tries to avoid extra Strings. It may or not be "more efficient". Try it with the real data/usage. YMMV.

String mapAccentChar (char ch) {
    switch (ch) {
        case 'à': return "a'";
        // etc
    }
    return null;
}

String mapAccents (String input) {
  StringBuilder sb = new StringBuilder();
  int l = input.length();
  for (int i = 0; i < l; i++) {
    char ch = input.charAt(i);
    String mapped = mapAccentChar(ch);
    if (mapped != null) {
      sb.append(mapped);
    } else {
      sb.append(ch);
  }
  return sb.toString();
}

score 2 · Answer 4 · answered Oct 20 '12 at 18:06

2

Since there is no strict correlation between ASCII value of a char and its accented version, your replacement seems to me the most straightforward way.

answered Oct 20 '12 at 18:06

SomeWittyUsername

17,203
3
34
78

Efficently Replacement of all unsupported chars in a String

4 Answers4

Linked