You may want to have a look at Unicode Support in Java
I think basically you want the Unicode property \p{L}
. This would match any code point that has the property "letter".
So your regex could look like this
Pattern p=Pattern.compile("[\\p{L}/]");
I just replaced the character ranges a-zA-Z
with \p{L}
Since Java 7 you could also use Pattern.UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes.
That would turn the predefined \w
into the Unicode version, means it would match all Unicode letters and digits (and string connecting characters like _)
So to match your string コメント_1050_固-減価償却費
, you could use
Pattern p=Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);
This would match any string consisting of letters, digits and _
See here for more details
and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.
See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (or will be in Java 8)