2

I have this following code snippet :

private static final Pattern ESCAPER_PATTERN = Pattern.compile("[^a-zA-Z0-9\\p{P}\\s]*");

/**
 * @param args
 */
public static void main(String[] args)
{
    String unaccentedText = "Aa123 \\/*-+.=+:/;.,?u%µ£$*^¨-)ac!e§('\"e&€#²³~´][{^";
    System.out.println(ESCAPER_PATTERN.matcher(unaccentedText).replaceAll(""));         
}

When I execute this with JDK 7 the output I get is:

Aa123 \/*-.:/;.,?u%*-)ac!e('"e][{

When I execute the same with JDK 8 the output I get is:

Aa123 \/*-.:/;.,?u%*-)ac!e§('"e][{

Notice that the section sign § is not removed with JDK 8.

Please let me know the regex to be used in case of JDK 8 to match the section sign as well - and also the reason for this difference in behaviour between jdks.

Sean Bright
  • 109,632
  • 17
  • 131
  • 138
  • I don't see `\p{P}` in the [Java 8 documentation for `Pattern`](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) - what is that supposed to match? – Sean Bright Jul 10 '15 at 12:56
  • 1
    @SeanBright: See https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#ucc and look for "General category" constants in https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html – nhahtdh Jul 10 '15 at 12:57
  • @nhahtdh - I'm a bit thick and I'm not seeing just "P" anywhere. I'll take your word for it that if I read it more closely I would understand :-) – Sean Bright Jul 10 '15 at 13:01
  • 1
    @SeanBright: Ah, that doesn't seem to be clearly mentioned. `P` encompasses all categories whose shorthand started with `P` (Punctuation). – nhahtdh Jul 10 '15 at 13:03
  • @nhahtdh - Got it. Thanks for the education. – Sean Bright Jul 10 '15 at 13:04

1 Answers1

13

Unicode moved your cheese

The character U+00A7 SECTION SIGN was changed from category So (Symbol, Other) to category Po (Punctuation, Other) in Unicode 6.1.0:

  • UnicodeData.txt

    • U+00A7, U+00B6, U+0F14, U+1360, and U+10102 were changed from gc=So to gc=Po.

Since Java uses Unicode 6.0.0 in version 7, and updates to Unicode 6.2.0 in version 8, it explains the difference in the result. As § now belongs Punctuation category, it is matched by \p{P} in Java 8.

Wrong solution

Since regular punctuations like !, #, ", ... also belong to Po category, we can't really remove this subcategory.

The next obvious solution is to use character set intersection to remove the unwanted character:

"[^a-zA-Z0-9\\p{P}\\s&&[^\u00a7]]"

... but wait a minute, there is a bug in Java with negated character class inside negated character class, the regex above compiles to:

[^a-zA-Z0-9\p{P}\s&&[^§]]
Start. Start unanchored match (minLength=1)
Pattern.intersection. S ∩ T:
  Pattern.setDifference. S ∖ T:
    Pattern.setDifference. S ∖ T:
      Pattern.setDifference. S ∖ T:
        Pattern.setDifference. S ∖ T:
          CharProperty.complement. S̄:
            Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
          Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
        Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
      DEBUG charProp: java.util.regex.Pattern$Category
    Ctype. POSIX (US-ASCII): SPACE
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      §
java.util.regex.Pattern$LastNode
Node. Accept match

... which resolves to [^a-zA-Z0-9\p{P}\s] intersect with [^§], instead of not ([a-zA-Z0-9\p{P}\s] intersect with [^§]).

Correct solution

To workaround the bug above, the working solution is:

"[[^a-zA-Z0-9\\p{P}\\s]\u00a7]"

which compiles to:

[[^a-zA-Z0-9\p{P}\s]§]
Start. Start unanchored match (minLength=1)
Pattern.union. S ∪ T:
  Pattern.setDifference. S ∖ T:
    Pattern.setDifference. S ∖ T:
      Pattern.setDifference. S ∖ T:
        Pattern.setDifference. S ∖ T:
          CharProperty.complement. S̄:
            Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
          Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
        Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
      DEBUG charProp: java.util.regex.Pattern$Category
    Ctype. POSIX (US-ASCII): SPACE
  BitClass. Match any of these 1 character(s):
    §
java.util.regex.Pattern$LastNode
Node. Accept match

The § is correctly included in the character class this time, so the sign will be removed.

Note that I have removed the quantifier for demonstration purpose. Please add the quantifier back to the character class in your code, preferably one or more + quantifier, instead of zero or more quantifier as used in the question.

Community
  • 1
  • 1
nhahtdh
  • 52,949
  • 15
  • 113
  • 149