3

I have a problem with Java Regex applied on supplementary chars

String x = new StringBuilder().appendCodePoint(0x10001).toString();
// x == "" (char['\uD800', '\uDC01']) - ok
String y = x.replaceAll("[\\x{10000}-\\x{10010}]", "*");
// y == "*" (char['*']) - ok
String z = x.replaceAll("[^\\x{10000}-\\x{10010}]", "*");
// z == "�*" (char['\uD800', '*']) - NOT ok

I'm expecting that x == z. What am I doing wrong? jdk1.8.0_144

nEraquasAr
  • 141
  • 1
  • 4
  • 1
    Potentially try looking at [Java regex for support Unicode?](https://stackoverflow.com/questions/10894122/java-regex-for-support-unicode) – phflack Dec 04 '17 at 15:58
  • @BeeOnRope There is no anchor; it's `[^` not `^[`. The second regex is "match any character the codepoint of which is not within the range `10000-10010`". – HTNW Dec 04 '17 at 17:23
  • 1
    Oops duh! Looks like a bug in Unicode handling with character classes. Apparently the regex spuriously matches the second half of the surrogate pair and replaces it. It's consistent with incorrectly incrementing by "code unit" rather than "code point" along that code path. – BeeOnRope Dec 04 '17 at 17:27

0 Answers0