How to replace all non-supplementary chars, but leave all supplementary as is?

Asked Dec 04 '17 at 15:52

Active Dec 04 '17 at 15:55

Viewed 105 times

I have a problem with Java Regex applied on supplementary chars

String x = new StringBuilder().appendCodePoint(0x10001).toString();
// x == "" (char['\uD800', '\uDC01']) - ok
String y = x.replaceAll("[\\x{10000}-\\x{10010}]", "*");
// y == "*" (char['*']) - ok
String z = x.replaceAll("[^\\x{10000}-\\x{10010}]", "*");
// z == "�*" (char['\uD800', '*']) - NOT ok

I'm expecting that x == z. What am I doing wrong? jdk1.8.0_144

edited Dec 04 '17 at 15:55

asked Dec 04 '17 at 15:52

nEraquasAr

1

Potentially try looking at [Java regex for support Unicode?](https://stackoverflow.com/questions/10894122/java-regex-for-support-unicode) – phflack Dec 04 '17 at 15:58
@BeeOnRope There is no anchor; it's `[^` not `^[`. The second regex is "match any character the codepoint of which is not within the range `10000-10010`". – HTNW Dec 04 '17 at 17:23
1

Oops duh! Looks like a bug in Unicode handling with character classes. Apparently the regex spuriously matches the second half of the surrogate pair and replaces it. It's consistent with incorrectly incrementing by "code unit" rather than "code point" along that code path. – BeeOnRope Dec 04 '17 at 17:27

How to replace all non-supplementary chars, but leave all supplementary as is?

0 Answers0