0

I wanted to modify icu4j cyrillic to latin to keep spaces. The obvious thing is

@Test
public void test1() {
    String greek
            = "'E\u00E9 \u043c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u043a\u0430";
    String id1 = "Any-Latin; NFD; [^\\p{Alnum} ] Remove";
    String id2 = "Any-Latin; NFD";
    String latin1 = com.ibm.icu.text.Transliterator.getInstance(id1)
            .transform(greek);
    Assert.assertEquals("Ee matematika", latin1);
}

but this fails (with ICU4J 54.1.1):

junit.framework.ComparisonFailure: expected:<Ee[ ]matematika> but was:<Ee[]matematika>">junit.framework.ComparisonFailure: expected:<Ee[ ]matematika> but was:<Ee[]matematika> at junit.framework.Assert.assertEquals

I can replaceAll in Java code with the same regex and it does work:

@Test
public void test2() {
    String greek
            = "'E\u00E9 \u043c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u043a\u0430";
    String id1 = "Any-Latin; NFD; [^\\p{Alnum} ] Remove";
    String id2 = "Any-Latin; NFD";
    String latin1 = com.ibm.icu.text.Transliterator.getInstance(id1)
            .transform(greek);
    Assert.assertEquals("Eematematika", latin1); // why not "Ee matematika"?
    String latin2 = com.ibm.icu.text.Transliterator.getInstance(id2)
            .transform(greek).replaceAll("[^\\p{Alnum} ]", "");
    Assert.assertEquals("Ee matematika", latin2);
}

and so does replacing the space in the transliterator ID with \\x20. Is this just a bug in ICU4J or somehow expected?

Alexey Romanov
  • 154,018
  • 31
  • 276
  • 433

1 Answers1

0

It is possible that toString() of the transform()'s ReplaceableString output:

public String transform(String source) {
    return transliterate(source);
}
...
public final String transliterate(String text) {
    ReplaceableString result = new ReplaceableString(text);
    transliterate(result);
    return result.toString();
}

Try to convert the strings you get into UTF16 code points and check if there is any difference.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • `ReplaceableString.toString()` just returns `buf.toString()`. If there was no difference in code points, how could there be difference in output? – Alexey Romanov Mar 19 '15 at 06:52