13

I have an input field which is localized. I need to add a validation using a regex that it must take only alphabets and numbers. I could have used [a-z0-9] if I were using only English.

As of now, I am using the method Character.isLetterOrDigit(name.charAt(i)) (yes, I am iterating through each character) to filter out the alphabets present in various languages.

Are there any better ways of doing it? Any regex or other libraries available for this?

stema
  • 80,307
  • 18
  • 92
  • 121
ManuPK
  • 10,995
  • 9
  • 54
  • 75
  • So you want to handle also languages other than English, right? – Lukasz Feb 29 '12 at 13:18
  • looking for a generic solution **including English** – ManuPK Feb 29 '12 at 13:20
  • 1
    According to [this](http://stackoverflow.com/questions/2392194/how-to-match-the-international-alphabet-english-a-z-non-english-with-a-regu) post the `\w` also works in perl's regular expressions on unicode characters, I dunno if this is so in java regexs. – user1227804 Feb 29 '12 at 13:25
  • `\w` is `A word character: [a-zA-Z_0-9]`. So, no. – beerbajay Feb 29 '12 at 13:31
  • 4
    @beerbajay this is not completely true anymore, its still the standart, but `Pattern.UNICODE_CHARACTER_CLASS` enables the Unicode version of the Predefined character classes and POSIX character classes. – stema Feb 29 '12 at 14:04
  • @stema Good tip, thanks! – beerbajay Feb 29 '12 at 14:11
  • 1
    @ManuPK Please note that using `charAt` in Java is always wrong. You should be calling `codePointAt`, and adjusting your `i` accordingly. – tchrist Feb 29 '12 at 15:27
  • @tchrist point taken.Thank you. – ManuPK Feb 29 '12 at 15:32
  • 1
    I must point out that you used term "alphabet". I believe, what you really meant is script. BTW. Please be aware, that Regular Expressions mentioned in the answers capture all numerals, including [Roman Numerals](http://en.wikipedia.org/wiki/Roman_numerals). You might also want to read about [Unicode Regular Expressions](http://unicode.org/reports/tr18/). – Paweł Dyda Feb 29 '12 at 19:25

3 Answers3

22

Since Java 7 you can use Pattern.UNICODE_CHARACTER_CLASS

String s = "Müller";

Pattern p = Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = p.matcher(s);
if (m.find()) {
    System.out.println(m.group());
} else {
    System.out.println("not found");
}

with out the option it will not recognize the word "Müller", but using Pattern.UNICODE_CHARACTER_CLASS

Enables the Unicode version of Predefined character classes and POSIX character classes.

See here for more details

You can also have a look here for more Unicode information in Java 7.

and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.

See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (of will be in Java 8)

Community
  • 1
  • 1
stema
  • 80,307
  • 18
  • 92
  • 121
  • Of course this will also match underscores and other connecting punctuation. – Tim Pietzcker Feb 29 '12 at 14:08
  • @TimPietzcker thats true, if that matters, then your answer would be the better choice for the OP (+1 for you) – stema Feb 29 '12 at 14:16
  • @TimPietzcker Under `UNICODE_CHARACTER_CLASS`, the so-called POSIX classes also match per [UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties); that is, `\p{alpha}` becomes — if and only if when compiled under the `Pattern` compilation flag — exactly equal to the Unicode `Alphabetic=True` property, which is itself somewhat complicated but quite useful, and which does not include connector punctuation. Sorry for the run-on sentence. :) – tchrist Feb 29 '12 at 14:51
  • 1
    Just to add to this answer, Unicode Character Class could be enabled via embedded expression ?U, as mentioned in the [Pattern class documentation](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS). – Paweł Dyda Feb 29 '12 at 19:29
9
boolean foundMatch = name.matches("[\\p{L}\\p{Nd}]*");

should work.

[\p{L}\p{Nd}] matches a character that is either a Unicode letter or digit. The regex .matches() method ensures that the entire string matches the pattern.

Tim Pietzcker
  • 297,146
  • 54
  • 452
  • 522
  • 1
    Other possible unicode categories (e.g. `L` or `N`) can be found [here](http://www.fileformat.info/info/unicode/category/index.htm). – beerbajay Feb 29 '12 at 13:45
  • You don’t need the braces for the 7 major categories. You might also like `\pM`, so `[\pL\pM\pN]`. Note that that is already a broader definition than `\p{Alphabetic}`, because it includes all marks, not just some of them. That puts it closer to the `\p{word}` property used for program identifiers, which per [UTS#18 Annec C](http://unicode.org/reports/tr18/#Compatibility_Properties) is `[\p{alpha}\p{gc=Mark}\p{gc=Digit}\p{gc=Pc}]`, where `\p{alpha}` is complicated, but basically picks only a few of the marks. – tchrist Feb 29 '12 at 15:50
  • @TimPietzcker Hold on: your boolean test is wrong. All possible strings match zero or more repetitions of anything. I don’t think you want that star. Also, as commented elsewhere, although it’s probably want you want, `\pN` is more than just digits; `\p{Nd}` is just decimal digits without Roman numerals, vulgar fractions, or sub- and superscripts, etc. Just call `\pN` any numeric, not any digit, and you’ll be right. – tchrist Feb 29 '12 at 15:51
  • @tchrist: The `matches()` method requires the regex to match the entire input string, not just a substring. So it only matches if the entire string is composed of letters/digits (or is empty, which arguably fulfills that definition too). Good point about `\p{Nd]}`. – Tim Pietzcker Feb 29 '12 at 17:58
1

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

-- Jamie Zawinksi

I say this in jest, but iterating through the String like you are doing will have runtime performance at least as good as any regex — there's no way a regex can do what you want any faster; and you don't have the overhead of compiling a pattern in the first place.

So as long as:

  • the validation doesn't need to do anything else regex-like (nothing was mentioned in the question)
  • the intention of the code looping through the String is clear (and if not, refactor until it is)

then why replace it with a regex just because you can?

Sean Reilly
  • 20,366
  • 3
  • 46
  • 61
  • 2
    It would be interesting to back up this claim by measurements. – Tim Pietzcker Feb 29 '12 at 14:18
  • +1 well you can agree or disagree, it really an interesting link! – ManuPK Feb 29 '12 at 14:22
  • @Tim: you don't even really need measurements. Unless you're using quantum computing, you can't verify that all characters in a list of characters (aka a String) are letters or digits without visiting each character, and stopping as soon as you find one that isn't. Since this is what the custom code does, it's the minimum possible amount of work. Regexes aren't magic. – Sean Reilly Feb 29 '12 at 14:44
  • 3
    Regexes get things right more often than handcoding does. For example, would you have remembered to use `codePointAt` instead of the erroneous `charAt` that the OP used? The regex would have already taken care of that for you. Handrolled code can be as tight as a regex, but usually isn’t. It depends on how much time you want to put into crafting it versus how much time the fellow who did the regex library put into doing so. A regex can replace pages of complicated, error-prone code. Always use the regex first, then optimize later only if profiling proves this is needed. Programmer time wins. – tchrist Feb 29 '12 at 15:30
  • @tchrist: `Always use the regex first, then optimize later only if profiling proves this is needed`. `Programmer time wins`. Those two statements often contradict each other — and a overly complicated regex is frequently present when they do. I agreed completely with the second statement, but not necessarily the first. If we change the word "regex" to "straightforward solution", (a regex is often, but not always the straightforward solution, especially in Java), and then I'd largely agree with you. – Sean Reilly Feb 29 '12 at 15:39