2

this is my current regex code to validate english & numbers:

const CANONICAL_FMT = '[0-9a-z]{1,64}';

public static function isCanonical($str)
{
    return preg_match('/^(?:' . self::CANONICAL_FMT . ')$/', $str);

}

Pretty straight forward. Now i want to change that to validate only hebrew, underscore and numbers. So i changed the code to:

public static function isCanonical($str)
{
    return preg_match('/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i', $str);

} 

But it doesn't work. I basically took the hebrew UTF range out of Wikipedia. What is Wrong here?

Tom
  • 8,259
  • 23
  • 78
  • 141

3 Answers3

3

I was able to get it to work much more easily, using the /u flag and the \p{Hebrew} Unicode character property:

return preg_match('/^(?:\p{Hebrew}+|\w+)$/iu', $str);

Working example: http://ideone.com/gSlmh

Kobi
  • 125,267
  • 41
  • 244
  • 277
  • KObi, what's the beginning ?: stands for? – Tom Jul 24 '11 at 13:18
  • @Tom - nothing special - it is a [non-capturing group](http://stackoverflow.com/questions/3512471/non-capturing-group). I just copied it from the question `:)` – Kobi Jul 24 '11 at 16:27
1

If you want preg_match() to work properly with UTF-8, you might have to enable the u modifier (quoting) :

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8.


In your case, instead of using the following regex :

/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i

I suppose you'd be using :

/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/iu

(Note the additionnal u at the end)

Pascal MARTIN
  • 374,560
  • 73
  • 631
  • 650
  • 1
    If it didn't work, can you provide what version of PHP and PCRE you're using? That information is in phpinfo(), and I ask because PCRE has been significant improvements in newer PHP versions. – Eric Caron Jul 22 '11 at 23:02
1

You need the /u modifier to add support for UTF-8.

Make sure you convert your hebrew input to UTF-8 if it's in some other codepage/character set.

Ariel
  • 23,798
  • 4
  • 53
  • 68
  • i tried: $str = utf8_encode($str); and then:/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/iu without any luck. – Tom Jul 22 '11 at 21:48
  • 1
    @Tom `utf8-encode()` encodes a ISO-8859-1 string, but that is not a hebrew character set. What is the incoming character set? Try `mb_convert_encoding()` with the proper character set. Is this data coming from a webpage? Because you would make you life a lot easier if you did utf-8 in the webpage, then you don't need conversions. Also, `\u0590` is not legal in preg. You need `\x{0590}`. – Ariel Jul 22 '11 at 21:53