14

My question is quite simple yet puzzling. It could be that there is a simple switch which fixes this but I'm not much experienced in Java regexes...

String line = "";
line.replaceAll("(?i)(.)\\1{2,}", "$1");

This crashes. If I remove the (?i) switch, it works. The three unicode characters are not random, they were found amidst a big Korean text, but I don't know they are valid or not.

Strange thing is that the regex works for all the other text but this. Why do I get the error?

This is the exception I get

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 6
    at java.lang.String.charAt(String.java:658)
    at java.lang.Character.codePointAt(Character.java:4668)
    at java.util.regex.Pattern$CIBackRef.match(Pattern.java:4846)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4125)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Start.match(Pattern.java:3408)
    at java.util.regex.Matcher.search(Matcher.java:1199)
    at java.util.regex.Matcher.find(Matcher.java:592)
    at java.util.regex.Matcher.replaceAll(Matcher.java:902)
    at java.lang.String.replaceAll(String.java:2162)
    at tokenizer.Test.main(Test.java:51)
Volker Stolz
  • 6,914
  • 29
  • 47
binit
  • 466
  • 5
  • 15

3 Answers3

3

The characters you mentioned are actually "Double byte characters". Which means that two bytes form one character. But for Java to interpret this, the encoding information (when it is different from the default platform encoding) needs to be passed explicitly (or else default platform encoding will be used).

To prove this, consider following

String line = "";
System.out.println(line.length());

this prints the length as 6 ! Whereas we only have three characters,

now the following code

String line1 = new String("".getBytes(),"UTF-8");
System.out.println(line1.length());

prints length as 3 which intended.

if you replace the line

String line = "";

with

 String line1 = new String("".getBytes(),"UTF-8");

it works and regex does not fail. I have used UTF-8 here. Please use the appropriate encoding of your intended platform.

Java regex libraries depend heavily on Character Sequence which in turn depends on the encoding scheme. For the strings having character encoding different from the default encoding, characters cannot be decoded correctly (it showed 6 chars instead of 3 !) and hence regex fails.

Santosh
  • 16,973
  • 4
  • 50
  • 75
  • Hey Santosh, your fix is not working at my end. I tried: new String("".getBytes(),"UTF-8").replaceAll("(?i)(.)\\1{2,}", "$1"); and it still crashes... also new String("".getBytes(),"UTF-8").length() shows me 6 (you have mentioned 3)! – binit Apr 15 '13 at 10:49
  • On my machine (Win XP SP2, jdk1.6.0_14) it shows 3 chars. What is the OS/JDK you are using ? Can you try some different encoding (e.q. UTF-16) ? What is the default charset of your machine ? – Santosh Apr 15 '13 at 11:03
  • `line1.length()` can only be `3` if your platform default encoding doesn't support the characters and thus encodes `?` in place of them. So you are seeing the length of the string `"???"`, don't know how that is intended. If your platform encoding is `UTF-8` you will get useless round-trip. – Esailija Apr 15 '13 at 11:14
  • `line1.length()=3` is only true for single byte chars. When I print the string without encoding it prints `??????`, i.e. one char for one byte. – Santosh Apr 15 '13 at 11:44
  • I also have Win XP SP2, jdk1.6.0_16. System's encoding found out by java.nio.charset.Charset.defaultCharset() is "UTF-8". The machine is 32-bit (if it matters). Changing to "UTF-16" works, it applies the regex successfully. But I have a restriction to use "UTF-8", in fact, it would be best if I could simply ignore the UTF-16 characters and apply the regex rather than throw an exception. – binit Apr 15 '13 at 12:02
  • @Esailija input is changed to `??????` because input has 6 bytes (_two bytes for each char. remember those are double byte char_). In absence of encoding, each byte is decoded to one char and hence you see six chars `??????`. – Santosh Apr 15 '13 at 13:13
  • @Esailija thats what I said, it has no point but its being being done nonetheless when you do not specify encoding. – Santosh Apr 15 '13 at 13:18
  • @Santosh but now you are saying there is no point whereas in your post you say that this is the intended result? What is intended about `??????`? – Esailija Apr 15 '13 at 13:19
  • @binit,if UTF-16 works, this means that's the char encoding your application needs. If you have compulsion to use UTF-8 then you need to find similar chars which falls within the range of UTF-8 – Santosh Apr 15 '13 at 13:19
  • @Esailija Not sure what I missed, but by _intended_ I meant that when I try to find a length of a string having 3 character I should get length as 3. – Santosh Apr 15 '13 at 13:21
  • @Santosh but you only get 3 because your string contains garbage. The `3` is meaningless because it's from a garbage string. – Esailija Apr 15 '13 at 13:22
  • @Santosh for example, if your platform encoding is Shift-JIS, then `line1.length()` will be `3`, but what is the point if the string is just `???` I don't get it. You turned the string into useless garbage, and the garbage would get converted to `?` with the regex. Why would you want to do that? – Esailija Apr 15 '13 at 13:25
  • @Esailija, yes the decoded part is garbage (and no contention abt that) but bytes themselves are not garbage. OP mentions that its a 3 character string and it worked for him with UTF-16 encoding. – Santosh Apr 15 '13 at 13:34
  • 1
    @Santosh then op's definition of "working" is pretty bad, at best he will not get an exception when using the regex but he won't have usable results. If you just wanted to count actual characters, you could have used `Character.codePointCount` - no need to turn the string into garbage :) – Esailija Apr 15 '13 at 13:35
  • @Esailija Well, counting character was aimed at showing the effect of encoding (versus default encoding) thats all. – Santosh Apr 15 '13 at 13:38
  • Guys, just to clarify, by working I meant that it didn't throw exception. The other readable characters in the text are clearly UTF-8 and get jumbled in UTF-16 like Esailija said. The other reason is that I don't want to mix UTF-8 and UTF-16 so I am forced with UTF-8 (since I know other chars are UTF-8). – binit Apr 15 '13 at 13:56
  • @binit strings in Java are always UTF-16 and your string is correct to begin with - it's a bug with regular expression, nothing to do with encodings or `.getBytes()`. What I am trying with my comments is to make Santosh realize his answer is useless. When you talk about UTF-8 you mean storage or transmission encoding, but the in-memory strings in Java are always internally stored in UTF-16. – Esailija Apr 16 '13 at 05:46
1

What's explained by Santosh in this answer is incorrect. This can be demonstrated by running

String str = "";
System.out.println("code point: " + .codePointAt(0));

which will output (at least for me) the value 128149, which is confirmed by this page as correct. So Java does not interpret the string in a wrong way. It did interpret it wrong when using the getBytes() method.

However, as explained by OP, it seems the regular expression crashes on that. I have no other explanation for it as it being a bug in java. Either that, or then it doesn't support UTF-16 fully by design.

Edit:

based on this answer:

the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java’s Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!

It would seem that this is a limitation of regular expressions in java.


Since you commented that

it would be best if I could simply ignore the UTF-16 characters and apply the regex rather than throw an exception.

This can certainly be done. A straightforward way is to only apply your regex to a certain range. Filtering unicode character ranges has been explained in this answer. Based on that answer, example that doesn't seem to choke but just leaves the problem characters alone:

line.replaceAll("(?Ui)([\\u0000-\\uffff])\\1{2,}", "$1")    

// "" -> ""
// "foo  foo" -> "foo  foo"
// "foo aAa foo" -> "foo a foo"
Community
  • 1
  • 1
eis
  • 45,245
  • 11
  • 129
  • 177
  • line.replaceAll("(?Ui)([\\u0000-\\uffff])\\1{2,}", "$1"); This seems to be the way to go, and bypass the bug. Thanks. – binit Apr 16 '13 at 06:37
  • @binit no problem. Actually, as additional information, [this link](http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html) tells that java regex should be able to handle the supplementary characters, so I think this confirms you're dealing with a bug. – eis Apr 16 '13 at 10:00
1

Actually, it's just a bug.

This is what stack traces and open source are for.

When CIBackRef (for case-insensitive back reference) compares with the group, it doesn't bump the loop index correctly. This shows the fix:

        // Check each new char to make sure it matches what the group
        // referenced matched last time around
        int x = i;
        for (int index=0; index<groupSize; ) {
            int c1 = Character.codePointAt(seq, x);
            int c2 = Character.codePointAt(seq, j);
            if (c1 != c2) {
                if (doUnicodeCase) {
                    int cc1 = Character.toUpperCase(c1);
                    int cc2 = Character.toUpperCase(c2);
                    if (cc1 != cc2 &&
                        Character.toLowerCase(cc1) !=
                        Character.toLowerCase(cc2))
                        return false;
                } else {
                    if (ASCII.toLower(c1) != ASCII.toLower(c2))
                        return false;
                }
            }
            int n = Character.charCount(c1);
            x += n;
            index += n;  // was index++
            j += Character.charCount(c2);
        }

groupSize is the total charCount of the group. j is the index for the referenced group.

The test

  //9ff0 9592 9ff0 9592 9ff0 9592
  val line = "\ud83d\udc95\ud83d\udc95\ud83d\udc95"
  Console println Try(line.replaceAll("(?ui)(.)\\1{2,}", "$1"))

fails normally

apm@mara:~/tmp$ skalac kcharex.scala ; skala kcharex.Test
Failure(java.lang.StringIndexOutOfBoundsException: String index out of range: 6)

but succeeds with the fix

apm@mara:~/tmp$ skala -J-Xbootclasspath/p:../bootfix kcharex.Test
Success()

The other bug in the original sample code is that the inline flags should include ?ui. The javadoc on Pattern.CASE_INSENSITIVE says:

By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag.

As you can see from the code snippet, without u, it will fail only if ASCII.toLower doesn't compare equal, which is not intended. I'm not sophisticated enough to know of a supplementary character that would fail that test without writing code to figure it out.

som-snytt
  • 38,672
  • 2
  • 41
  • 120