What regex can match similar characters?

Question

What regex could match similar characters, like (ä and a) or in Russian (и and й)? Below my code...

Sting text1 = " Passagiere noch auf ihr fehlendes Gepäck"
Sting text2 = " Passagiere noch auf ihr fehlendes Gepack"

Pattern p1 = Pattern.compile("\\b" + "Gepack");
Pattern p2 = Pattern.compile("\\b" + "Gepack");

Matcher m1 = p1.matcher(text1); // doesn't find any occurrence
Matcher m2 = p2.matcher(text2) // founds one occurrence

Not sure this is the right duplicate as the linked to article is more about transliteration than normalisation. — JGNI, Mar 07 '19 at 14:41

score 1 · Accepted Answer · answered Mar 07 '19 at 14:28

You could build up a character class of all the characters you want to match so you could replace pattern one with

Pattern p1 = Pattern.compile("\\b" + "Gep[aä]ck");

But this could get very burdensome very quickly

There is a mechanism in Unicode called Normalisation, see here for details, that lets you reformat your string to compare in different ways.

Normalisation Form Canonical Decomposition (NFD) takes a string containing accented character code points and creates multiple code points, starting with the base character and then with code points cosponsoring to combining character versions of the accents in a well defined order for each accented character.

Having done this to your input you can use a regex to remove all the accents from the string as they will all have the Unicode property Mark, sometimes shortened to M.

This gives you a string containing only base characters that your regex will match against.

What regex can match similar characters?

1 Answers1