I wanted to modify icu4j cyrillic to latin to keep spaces. The obvious thing is
@Test
public void test1() {
String greek
= "'E\u00E9 \u043c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u043a\u0430";
String id1 = "Any-Latin; NFD; [^\\p{Alnum} ] Remove";
String id2 = "Any-Latin; NFD";
String latin1 = com.ibm.icu.text.Transliterator.getInstance(id1)
.transform(greek);
Assert.assertEquals("Ee matematika", latin1);
}
but this fails (with ICU4J 54.1.1):
junit.framework.ComparisonFailure: expected:<Ee[ ]matematika> but was:<Ee[]matematika>">junit.framework.ComparisonFailure: expected:<Ee[ ]matematika> but was:<Ee[]matematika> at junit.framework.Assert.assertEquals
I can replaceAll
in Java code with the same regex and it does work:
@Test
public void test2() {
String greek
= "'E\u00E9 \u043c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u043a\u0430";
String id1 = "Any-Latin; NFD; [^\\p{Alnum} ] Remove";
String id2 = "Any-Latin; NFD";
String latin1 = com.ibm.icu.text.Transliterator.getInstance(id1)
.transform(greek);
Assert.assertEquals("Eematematika", latin1); // why not "Ee matematika"?
String latin2 = com.ibm.icu.text.Transliterator.getInstance(id2)
.transform(greek).replaceAll("[^\\p{Alnum} ]", "");
Assert.assertEquals("Ee matematika", latin2);
}
and so does replacing the space in the transliterator ID with \\x20
. Is this just a bug in ICU4J or somehow expected?