Hebrew regex match not working in php

Question

this is my current regex code to validate english & numbers:

const CANONICAL_FMT = '[0-9a-z]{1,64}';

public static function isCanonical($str)
{
    return preg_match('/^(?:' . self::CANONICAL_FMT . ')$/', $str);

}

Pretty straight forward. Now i want to change that to validate only hebrew, underscore and numbers. So i changed the code to:

public static function isCanonical($str)
{
    return preg_match('/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i', $str);

}

But it doesn't work. I basically took the hebrew UTF range out of Wikipedia. What is Wrong here?

score 3 · Accepted Answer · answered Jul 23 '11 at 09:41

3

I was able to get it to work much more easily, using the /u flag and the \p{Hebrew} Unicode character property:

return preg_match('/^(?:\p{Hebrew}+|\w+)$/iu', $str);

Working example: http://ideone.com/gSlmh

answered Jul 23 '11 at 09:41

Kobi

125,267
41
244
277

KObi, what's the beginning ?: stands for? – Tom Jul 24 '11 at 13:18
@Tom - nothing special - it is a [non-capturing group](http://stackoverflow.com/questions/3512471/non-capturing-group). I just copied it from the question `:)` – Kobi Jul 24 '11 at 16:27

score 1 · Answer 2 · answered Jul 22 '11 at 21:32

1

If you want preg_match() to work properly with UTF-8, you might have to enable the u modifier (quoting) :

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8.

In your case, instead of using the following regex :

/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i

I suppose you'd be using :

/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/iu

(Note the additionnal u at the end)

answered Jul 22 '11 at 21:32

Pascal MARTIN

374,560
73
631
650

1

If it didn't work, can you provide what version of PHP and PCRE you're using? That information is in phpinfo(), and I ask because PCRE has been significant improvements in newer PHP versions. – Eric Caron Jul 22 '11 at 23:02

score 1 · Answer 3 · answered Jul 22 '11 at 21:33

1

You need the /u modifier to add support for UTF-8.

Make sure you convert your hebrew input to UTF-8 if it's in some other codepage/character set.

answered Jul 22 '11 at 21:33

Ariel

23,798
4
53
68

i tried: $str = utf8_encode($str); and then:/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/iu without any luck. – Tom Jul 22 '11 at 21:48
1

@Tom `utf8-encode()` encodes a ISO-8859-1 string, but that is not a hebrew character set. What is the incoming character set? Try `mb_convert_encoding()` with the proper character set. Is this data coming from a webpage? Because you would make you life a lot easier if you did utf-8 in the webpage, then you don't need conversions. Also, `\u0590` is not legal in preg. You need `\x{0590}`. – Ariel Jul 22 '11 at 21:53

Hebrew regex match not working in php

3 Answers3

Linked