13

I'm trying to get Cyrillic words to be in latin so I can have them in urls. I use icu4j transliterator, but it still gives weird characters like this: Vilʹândimaa. It should be more like viljandimaa. When I copy that url these letters turn to %.. something useless.

Does anybody know how to get Cyrillic to a-z with icu4j?

UPDATE

Can't answer myself already but found this question that was very helpful: Converting Symbols, Accent Letters to English Alphabet

beaver
  • 485
  • 9
  • 16
ivar
  • 759
  • 3
  • 10
  • 17

3 Answers3

15

Modify your identifier to do what you want. You can strip unwanted characters using a regular expression with the Remove transform.

For example, consider the string "'Eé математика":

"'E\u00E9 \u043c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u043a\u0430"

The identifier "Any-Latin; NFD; [^\\p{Alnum}] Remove" will transliterate to Latin (which may still include accents), decompose accented characters into the letter and diacritics and remove anything that isn't an alphanumeric. The resultant string is "Eematematika".

You can read more on the identifiers under General Transforms on the ICU website.


Example:

//import com.ibm.icu.text.Transliterator;
String greek
       = "'E\u00E9 \u043c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u043a\u0430";
String id = "Any-Latin; NFD; [^\\p{Alnum}] Remove";
String latin = Transliterator.getInstance(id)
                             .transform(greek);
System.out.println(latin);

Tested against ICU4J 49.1.

McDowell
  • 102,869
  • 29
  • 193
  • 261
  • Thanks McDowell - could you give a really quick example? – Nic Cottrell Apr 05 '12 at 07:46
  • @Nicholas Tolley Cottrell - Example added. – McDowell Apr 07 '12 at 18:36
  • Thanks again McDowell - I ended up using "Any-Latin; NFD" since I wanted to preserve spaces. – Nic Cottrell Apr 17 '12 at 09:40
  • 1
    @NicholasTolleyCottrell - That will leave the diacritics intact (accents on Latin letters.) The point of the NFD transform is to separate accents and letters into two consecutive code points. If you want to preserve spaces, modify the regular expression in the `Remove` transformation. – McDowell Apr 17 '12 at 18:25
  • 1
    Another, perhaps cleaner, take would be to use `Any-Latin; Lower; Latin-ASCII` instead of NFD with manual filtering — that converts to ASCII-only as much as possible explicitly. – Václav Slavík Nov 03 '15 at 16:26
0

Have a look at: https://ru.stackoverflow.com/questions/633355/Показать-правильный-пример-транслитерации-на-java

Add denepdency:

<dependency>
    <groupId>com.ibm.icu</groupId>
    <artifactId>icu4j</artifactId>
    <version>63.1</version>
</dependency>

And transliterate:

var CYRILLIC_TO_LATIN = "Latin-Russian/BGN"
// var CYRILLIC_TO_LATIN = "Russian-Latin/BGN"
Transliterator toLatinTrans = Transliterator.getInstance(CYRILLIC_TO_LATIN);
String result = toLatinTrans.transliterate(st);
System.out.println(result);
Grigory Kislin
  • 12,805
  • 7
  • 98
  • 154
-1

Have no idea about icu4j, but in the Unicode table Cyrillic takes only a small range. Instead of relaying onto third-party libraries with unclear workings, I'd define a transliteration sequence for each Cyrillic symbol and did the translation myself.

P.S. What language word "viljandimaa" comes from? Doesn't sound like Cyrillic to me...

Vladimir Dyuzhev
  • 17,603
  • 9
  • 45
  • 61