2

What could be the regular expression to detect a multi byte string.

For example here is the expression to detect a string in english

Pattern p=Pattern.compile("[a-zA-Z/]");

Similarly I want a pattern which has multi bytes like

コメント_1050_固-減価償却費

stema
  • 80,307
  • 18
  • 92
  • 121
ramoh
  • 153
  • 3
  • 15
  • AFAIK, in Java UCS-2 is used, i.e. all strings are multibyte. You may input symbols with char `code > 127` just as Latin ones in their normal form: `ン` as well as in the following form: `\u30FC` – kirilloid Mar 29 '12 at 07:25

5 Answers5

3

You may want to have a look at Unicode Support in Java

I think basically you want the Unicode property \p{L}. This would match any code point that has the property "letter".

So your regex could look like this

Pattern p=Pattern.compile("[\\p{L}/]");

I just replaced the character ranges a-zA-Z with \p{L}


Since Java 7 you could also use Pattern.UNICODE_CHARACTER_CLASS

Enables the Unicode version of Predefined character classes and POSIX character classes.

That would turn the predefined \w into the Unicode version, means it would match all Unicode letters and digits (and string connecting characters like _)

So to match your string コメント_1050_固-減価償却費, you could use

Pattern p=Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);

This would match any string consisting of letters, digits and _

See here for more details

and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.

See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (or will be in Java 8)

Community
  • 1
  • 1
stema
  • 80,307
  • 18
  • 92
  • 121
  • But I tried following codes with JRE 8, it still wrong: String input = ""; Pattern p = Pattern.compile("[\\u2A775]", Pattern.UNICODE_CHARACTER_CLASS); System.out.println("In Range :" + p.matcher(input).find()); – Daniel Yang Jul 07 '16 at 03:25
2

If you want to detect whether you have a multi-byte strings you cna look at the length

if (text.length() != text.getBytes(encoding).length)

This will detect that a multi-byte character has been used for any encoding.

Peter Lawrey
  • 498,481
  • 72
  • 700
  • 1,075
1

Essentially, Java regular expressions work on Strings, not arrays of bytes - characters are represented as abstract "character" entities, not as bytes in some specific encoding. This is not completely true since the char type only contains characters from the Basic Multilingual Plane and Unicode chars from outside this range are represented as two char values each, but nonetheless "multibyte" is relative and depends on the encoding.

If what you need is "multibyte in UTF-8", then note that only characters with values 0-127 are single-byte in this encoding. So, the easiest way to check would be to use a loop and check each character - if it's greater than 127, it's more than one byte in UTF-8.

If you insist on using a regex, you could probably use the character range operator in the regex like this: [\u0080-\uFFFF] (haven't checked and \uFFFF is not really a character but I think the regex engine should accept it).

Michał Kosmulski
  • 9,250
  • 1
  • 28
  • 46
0

You will need to use Unicode for elements which are not in the English language. This link should provide you with some information.

npinti
  • 50,175
  • 5
  • 67
  • 92
0

There is a nice introduction to UniCode regex here.

David Brabant
  • 36,511
  • 13
  • 77
  • 101