2

we have this code:

$value = preg_replace("/[^\w]/", '', $value);

where $value is in utf-8. After this transformation first byte of multibyte characters is stripped. How to make \w cover UTF-8 chars completely?

Sorry, i am not very well in PHP

hakre
  • 178,314
  • 47
  • 389
  • 754
Andrey
  • 56,384
  • 10
  • 111
  • 154

5 Answers5

6

You could try with the /u modifier:

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

If that won't do, try

instead.

Gordon
  • 296,205
  • 68
  • 508
  • 534
4

There is this nasty u modifier to pcre patterns in PHP. It states that the regex is encoded in UTF8, but I found that it treats the input as UTF8, too.

soulmerge
  • 68,989
  • 18
  • 113
  • 147
  • This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Modifiers". – aliteralmind Apr 10 '14 at 00:44
2

Append u to regex, to turn on the multibyte unicode mode of PCRE:

$value = preg_replace("/[^\w]/u", '', $value);

Corollary

In unicode mode, PCRE expects everything is multibyte and if it is not then there will be problems meeting deadlines. Therefore, to convert anything to UTF-8 (and drop any unconvertible junk), we first use:

$value = iconv( 'ISO-8859-1', 'UTF-8//IGNORE//TRANSLIT', $i );

to clean and prep the input.

Because everything can be encoded into ISO-8859-1 (even if some obscure characters appear incorrectly), and since most web browsers run natively in 8859 (unless told to use UTF-8), we've found this function as a general, safe, effective method to 'take anything, drop any junk, and convert into UTF-8'.

mb_ereg_* is deprecated as of 5.3.0 -- so using those functions is not the right way to go.

Anthony
  • 34,084
  • 23
  • 90
  • 154
FYA
  • 402
  • 4
  • 6
1

try this function instead...http://php.net/manual/en/function.mb-ereg-replace.php

  • 5
    I would rather advise *not* to use `mb_ereg_replace`. It is built on the deprecated `ereg_replace`. See http://php.net/ereg_replace – soulmerge Mar 31 '10 at 14:10
0

Use [^\w]+ instead of [^\w]

You can also use \W in place of [^\w]

codaddict
  • 410,890
  • 80
  • 476
  • 515