9

I want to match the lower case of "I" of English (i) to lower case of "İ" of Turkish (i). They are the same glyph but they don't match. When I do System.out.println("İ".toLowerCase()); the character i and a dot is printed(this site does not display it properly)

Is there a way to match those?(Preferably without hard-coding it) I want to make the program match the same glyphs irrelevant of the language and the utf code. Is this possible?

I've tested normalization with no success.

public static void main(String... a) {
    String iTurkish = "\u0130";//"İ";
    String iEnglish = "I";
    prin(iTurkish);
    prin(iEnglish);
}

private static void prin(String s) {
    System.out.print(s);
    System.out.print(" -  Normalized : " + Normalizer.normalize(s, Normalizer.Form.NFD));
    System.out.print(" - lower case: " + s.toLowerCase());
    System.out.print(" -  Lower case Normalized : " + Normalizer.normalize(s.toLowerCase(), Normalizer.Form.NFD));
    System.out.println();

}

The result is not properly shown in the site but the first line(iTurkish) still has the ̇ near lowercase i.

Purpose and Problem

This will be a multi lingual dictionary. I want the program to be able to recognize that "İFEL" starts with "if". To make sure they are not case sensitive I first convert both text to lower case. İFEL becomes i(dot)fel and "if" is not recognized as a part of it

Deduplicator
  • 41,806
  • 6
  • 61
  • 104
WVrock
  • 1,644
  • 3
  • 19
  • 28
  • 2
    The both letters are not the same uni code so they doesn't match. – Zelldon Jun 09 '15 at 06:49
  • 1
    You can strip diacritic from string with [commons-lang](https://commons.apache.org/proper/commons-lang/): org.apache.commons.lang3.StringUtils.stripAccents(String) – agad Jun 09 '15 at 06:50
  • @agad Wouldn't it prevent differentiation of i from ı ? I would consider it if there is no way to do this. – WVrock Jun 09 '15 at 06:52
  • @Zelldon true but they are the same glyph. Isn't the point of normalization matching them? – WVrock Jun 09 '15 at 06:52
  • I'm not sure what you want to achieve. – agad Jun 09 '15 at 06:58
  • @agad I want to write "if" to a JTextArea and the program to select the "İFEL" from a JList. I did the algorithm. It first converts it to lower case to prevent case sensitivity. İFEL becomes i(dot)fel. So the program does not see that "İFEL" starts with "if". – WVrock Jun 09 '15 at 07:02
  • @Zelldon I would like to consider it as a last resort. I want the program to be multi lingual. Coding each letter by hand does not seem plausible. – WVrock Jun 09 '15 at 07:05
  • Is "İFEL" enum value? If yes, you can create the *toString(String)* and *fromString(String)* methods, that would match ASCII representation with proper value. – agad Jun 09 '15 at 07:08
  • @agad It is a string read from a txt file. – WVrock Jun 09 '15 at 07:10
  • With the int values of the chars the if works see: http://ideone.com/ZlUB2r – Zelldon Jun 09 '15 at 07:11
  • What is your code now? What about changing it to *if (StringUtils.stripAccents(value).startsWith(jTextValue)) ....*? – agad Jun 09 '15 at 07:14
  • @Zelldon charAt(0) matches because iTurkish is 2 chars. An i and a dot. In the link the dot is invisible but when I copied it to netbeans, dot is shown. `if (turkNorm.equals(engNorm))` returns false. – WVrock Jun 09 '15 at 07:20
  • Of course it has two bytes 'cause of the unicode. But if you only want to match the i you can simply check the first byte. Or not?! – Zelldon Jun 09 '15 at 07:24
  • @Zelldon "İFEL".toLowerCase starts with "i" and it works but it does not start with "if" and that is problem. – WVrock Jun 09 '15 at 07:27
  • @WVrock If you strip the diacritics after normalizing, as suggested, the result should start with `"if"`. – dimo414 Jun 09 '15 at 07:28
  • @agad StringUtils doesn't seem to have such a method. – WVrock Jun 09 '15 at 07:35
  • https://commons.apache.org/proper/commons-lang/javadocs/api-3.4/org/apache/commons/lang3/StringUtils.html#stripAccents(java.lang.String) – agad Jun 09 '15 at 07:39
  • @agad I don't have that library. Where am I supposed to get it? – WVrock Jun 09 '15 at 07:46
  • 1
    https://commons.apache.org/proper/commons-lang/download_lang.cgi – agad Jun 09 '15 at 08:13

2 Answers2

11

If you print out the hex values of the characters you're seeing, the difference is clear:

İ 0x130 - Normalized : İ 0x49 0x307 - Lower case: i̇ 0x69 0x307 - Lower case Normalized : i̇ 0x69 0x307
I 0x49 - Normalized : I 0x49 - Lower case: i 0x69 - Lower case Normalized : i 0x69

Normalizing the Turkish İ doesn't give you an English I, instead it gives you an English I followed by a diacritic, 0x307. This is correct, and to be expected by the normalization process. Normalization is not a "Convert to ASCII" operation. As the documentation for Normalizer mentions, the process it's following is a very rigorously defined standard, the Unicode Standard Annex #15 — Unicode Normalization Forms.

There are numerous ways to strip diacritics, either before or after normalizing. What you need will depend on the specifics of your use case, but for your use case I would suggest using Guava's CharMatcher class to strip non-ASCII characters after normalizing, e.g.:

String asciiString = CharMatcher.ascii().retainFrom(normalizedString);

This answer goes into more depth about what \p{InCombiningDiacriticalMarks} does, and why it's not ideal. My CharMatcher solution isn't ideal either (the linked answer offers more robust solutions), but for a quick fix you may find retaining only ASCII characters "good enough". This is both closer to "correct" and faster than the Pattern based approach.

dimo414
  • 42,340
  • 17
  • 131
  • 218
  • 1
    +1, Interesting side effect `"İ".toLowerCase()` seems to decide it needs decompose the character. At least here ... – dhke Jun 09 '15 at 07:16
  • Everybody seems to suggest stripping diacritics. I will probably do it this way. I guess matching "ıf" with "İF" is better than not matching "if" with "İF". Tough I'm not sure if this would be the case. – WVrock Jun 09 '15 at 07:30
  • 1
    @WVrock - as you've presented it, the best solution to your problem is to strip the diacritics. It's possible you have additional requirements you haven't told us about which might merit a different solution. But broadly speaking, if you want someone to be able to type English characters and map them to Turkish ones, you're going to have to strip *some* information, and you'll be hard pressed to avoid both false positives and false negatives. Your solution should try to minimize whichever is worse for your use case. – dimo414 Jun 09 '15 at 07:42
  • Even though this is the answer that guided me in the right direction, I prefer the code in the Rafiq's link – WVrock Jun 09 '15 at 18:26
-1

You can use the code bellow:

public static void main(String... a) {

      String iTurkish = "\u0130";//"İ";
      String iEnglish = "I";
      prin(iTurkish);
      prin(iEnglish);


}

private static void prin(String s) {
    System.out.print(s);
    String nfdNormalizedString = Normalizer.normalize(s, Normalizer.Form.NFD);
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    System.out.print(" -  Normalized : " + pattern.matcher(nfdNormalizedString).replaceAll(""));
    System.out.print(" - lower case: " + s.toLowerCase());
    System.out.print(" -  Lower case Normalized : " + Normalizer.normalize(pattern.matcher(nfdNormalizedString).replaceAll("").toLowerCase(), Normalizer.Form.NFD));
    System.out.println();

}

Or see Converting Symbols, Accent Letters to English Alphabet

Community
  • 1
  • 1
Rafiq
  • 722
  • 1
  • 5
  • 15
  • Not really nice to copy code from Utils class and present here as own. – agad Jun 09 '15 at 07:33
  • Why no vote? I provided the link "http://stackoverflow.com/questions/1008802/converting- symbols-accent- letters- to-english-alphabet" .Did not see it you? "agad" – Rafiq Jun 09 '15 at 07:41
  • +1 for providing a link to the answer and adapting it to the given code. Even though It would be better if you had first provided the link and then clarified that you are using someone else's code. – WVrock Jun 09 '15 at 07:59