Unicode moved your cheese
The character U+00A7 SECTION SIGN
was changed from category So (Symbol, Other) to category Po (Punctuation, Other) in Unicode 6.1.0:
UnicodeData.txt
- U+00A7, U+00B6, U+0F14, U+1360, and U+10102 were changed from gc=So to gc=Po.
Since Java uses Unicode 6.0.0 in version 7, and updates to Unicode 6.2.0 in version 8, it explains the difference in the result. As §
now belongs Punctuation category, it is matched by \p{P}
in Java 8.
Wrong solution
Since regular punctuations like !
, #
, "
, ... also belong to Po category, we can't really remove this subcategory.
The next obvious solution is to use character set intersection to remove the unwanted character:
"[^a-zA-Z0-9\\p{P}\\s&&[^\u00a7]]"
... but wait a minute, there is a bug in Java with negated character class inside negated character class, the regex above compiles to:
[^a-zA-Z0-9\p{P}\s&&[^§]]
Start. Start unanchored match (minLength=1)
Pattern.intersection. S ∩ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
CharProperty.complement. S̄:
Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
DEBUG charProp: java.util.regex.Pattern$Category
Ctype. POSIX (US-ASCII): SPACE
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
§
java.util.regex.Pattern$LastNode
Node. Accept match
... which resolves to [^a-zA-Z0-9\p{P}\s]
intersect with [^§]
, instead of not ([a-zA-Z0-9\p{P}\s]
intersect with [^§]
).
Correct solution
To workaround the bug above, the working solution is:
"[[^a-zA-Z0-9\\p{P}\\s]\u00a7]"
which compiles to:
[[^a-zA-Z0-9\p{P}\s]§]
Start. Start unanchored match (minLength=1)
Pattern.union. S ∪ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
CharProperty.complement. S̄:
Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
DEBUG charProp: java.util.regex.Pattern$Category
Ctype. POSIX (US-ASCII): SPACE
BitClass. Match any of these 1 character(s):
§
java.util.regex.Pattern$LastNode
Node. Accept match
The §
is correctly included in the character class this time, so the sign will be removed.
Note that I have removed the quantifier for demonstration purpose. Please add the quantifier back to the character class in your code, preferably one or more +
quantifier, instead of zero or more quantifier as used in the question.