strip out all strange characters of string

Question

I'm trying to create a slug so I would like to strip out every strange character. The only thing the slug should contain is lowercase letters and underscores. Is there a way to check for strange characters and filter it out the string? everything that is not a character or underscore should be deleted

this is what I have:

if(!preg_match_all('/[a-z]/')):
    $output = preg_replace("/ ... euhm ... /", "", $slug2);
else:
    $output = $slug2;
endif;

I should go from this: Create a 3D Ribbon Wrap-Around Effect (Plus a Free PSD!)

to this: create_a_3d_ribbon_wrap_around_effect_plus_a_free_psd

Sometimes it is easier to "remove all characters but this, this, and this"; I think that fits your case. — BeemerGuy, Nov 17 '10 at 23:02
So you also want to translate spaces into underscores? And numbers are also ok? Your example and your description do not match up. — cdhowie, Nov 17 '10 at 23:02
There are 1908 lowercase letters, of which your `[a-z]` comprises merely a hairsbreadth more than 1⅓%. — tchrist, Nov 17 '10 at 23:48
This is a duplicate of possibly many other questions. Here are some: http://stackoverflow.com/questions/4051889/regular-expression-any-text-to-url-friendly-one, http://stackoverflow.com/questions/1432463/how-do-i-sanitize-title-uris, http://stackoverflow.com/questions/25259/how-do-you-include-a-webpage-title-as-part-of-a-webpage-url, http://stackoverflow.com/questions/3984983/php-code-to-generate-safe-url, etc. — Gumbo, Nov 18 '10 at 06:57
@cdhowie of course it does not match, that's why I'm posting the question. — Christophe, Nov 18 '10 at 08:27
@Gumbo ah ok sorry about that, it was 1 am when I asked the question, I was a bit tired :) but I upvoted your anwser, I'll make sure to have al look next time — Christophe, Nov 18 '10 at 08:28
@everyone: I was already using strtolower but because that was not the problem I didn't post it in the example — Christophe, Nov 18 '10 at 08:29

score 3 · Accepted Answer · answered Nov 17 '10 at 23:04

3

$slug = strtolower($slug);
$slug = str_replace(" ", "_", $slug);
$slug = preg_replace("/[^a-z0-9_]/", "", $slug);

answered Nov 17 '10 at 23:04

cdhowie

133,716
21
261
264

I’m afraid you’ve forgotten one thousand eight hundred and eighty-two lowercase letters besides the those quaint 1960sish *a-z*. `ˋunichars -a '\p{Lower}' '[^a-z]' | wc -lˋ == 1882` – tchrist Nov 17 '10 at 23:43
Based on the OP's example, this looks like it will be used for slugs in a URL. And Unicode characters look *ugly* in URLs, so I doubt the OP wants to keep them. – cdhowie Nov 17 '10 at 23:45
Agreed. But interesting fact that there are more than latin, cyrillic and greek. ;) – AndreKR Nov 17 '10 at 23:55

John Kugelman · Answer 2 · 2010-11-17T23:39:09.433

No need for the initial match. You can do an unconditional search-and-replace. If there's nothing to replace, no big deal. Here it is as one big chain of function calls:

$slug = trim(preg_replace('/[\W_]+/', '_', strtolower($slug)), '_');

Or split out into separate lines:

$slug = strlower($slug);
$slug = preg_replace('/[\W_]+/', '_', $slug);
$slug = trim($slug, '_');

Explanation:

Convert uppercase to lowercase with strtolower.
Search for \W and _. A "word" character is a letter, digit, or underscore. A "non-word" character is the opposite of that, i.e. whitespace, punctuation, and control characters. \W matches "non-word" characters.
Replace those bad characters with underscores. If there's more than one in a row they'll all get replaced by a single underscore.
Trim underscores from the beginning and end of the string.

The code's on the complicated side because there are several tricky cases it needs to handle:

Bad characters on the ends need to be deleted, not converted to underscores. For example, the !) in your example.
We want foo_-_bar to turn into foo_bar, not foo___bar. Underscores should be collapsed, basically.

score 0 · Answer 3 · answered Nov 17 '10 at 23:02

0

$slug = preg_replace("[^a-z_]", "", $slug);

answered Nov 17 '10 at 23:02

AndreKR

28,030
13
86
146

You forgot almost two thousand lowercase letters. – tchrist Nov 17 '10 at 23:44

strip out all strange characters of string

3 Answers3