1

I'm trying to create a slug so I would like to strip out every strange character. The only thing the slug should contain is lowercase letters and underscores. Is there a way to check for strange characters and filter it out the string? everything that is not a character or underscore should be deleted

this is what I have:

if(!preg_match_all('/[a-z]/')):
    $output = preg_replace("/ ... euhm ... /", "", $slug2);
else:
    $output = $slug2;
endif;

I should go from this: Create a 3D Ribbon Wrap-Around Effect (Plus a Free PSD!)

to this: create_a_3d_ribbon_wrap_around_effect_plus_a_free_psd

Cœur
  • 32,421
  • 21
  • 173
  • 232
Christophe
  • 4,642
  • 5
  • 39
  • 82
  • Sometimes it is easier to "remove all characters but this, this, and this"; I think that fits your case. – BeemerGuy Nov 17 '10 at 23:02
  • 2
    So you also want to translate spaces into underscores? And numbers are also ok? Your example and your description do not match up. – cdhowie Nov 17 '10 at 23:02
  • There are 1908 lowercase letters, of which your `[a-z]` comprises merely a hairsbreadth more than 1⅓%. – tchrist Nov 17 '10 at 23:48
  • 1
    This is a duplicate of possibly many other questions. Here are some: http://stackoverflow.com/questions/4051889/regular-expression-any-text-to-url-friendly-one, http://stackoverflow.com/questions/1432463/how-do-i-sanitize-title-uris, http://stackoverflow.com/questions/25259/how-do-you-include-a-webpage-title-as-part-of-a-webpage-url, http://stackoverflow.com/questions/3984983/php-code-to-generate-safe-url, etc. – Gumbo Nov 18 '10 at 06:57
  • @cdhowie of course it does not match, that's why I'm posting the question. – Christophe Nov 18 '10 at 08:27
  • @Gumbo ah ok sorry about that, it was 1 am when I asked the question, I was a bit tired :) but I upvoted your anwser, I'll make sure to have al look next time – Christophe Nov 18 '10 at 08:28
  • @everyone: I was already using strtolower but because that was not the problem I didn't post it in the example – Christophe Nov 18 '10 at 08:29

3 Answers3

3
$slug = strtolower($slug);
$slug = str_replace(" ", "_", $slug);
$slug = preg_replace("/[^a-z0-9_]/", "", $slug);
cdhowie
  • 133,716
  • 21
  • 261
  • 264
  • I’m afraid you’ve forgotten one thousand eight hundred and eighty-two lowercase letters besides the those quaint 1960sish *a-z*. `ˋunichars -a '\p{Lower}' '[^a-z]' | wc -lˋ == 1882` – tchrist Nov 17 '10 at 23:43
  • Based on the OP's example, this looks like it will be used for slugs in a URL. And Unicode characters look *ugly* in URLs, so I doubt the OP wants to keep them. – cdhowie Nov 17 '10 at 23:45
  • Agreed. But interesting fact that there are more than latin, cyrillic and greek. ;) – AndreKR Nov 17 '10 at 23:55
1

No need for the initial match. You can do an unconditional search-and-replace. If there's nothing to replace, no big deal. Here it is as one big chain of function calls:

$slug = trim(preg_replace('/[\W_]+/', '_', strtolower($slug)), '_');

Or split out into separate lines:

$slug = strlower($slug);
$slug = preg_replace('/[\W_]+/', '_', $slug);
$slug = trim($slug, '_');

Explanation:

  1. Convert uppercase to lowercase with strtolower.
  2. Search for \W and _. A "word" character is a letter, digit, or underscore. A "non-word" character is the opposite of that, i.e. whitespace, punctuation, and control characters. \W matches "non-word" characters.
  3. Replace those bad characters with underscores. If there's more than one in a row they'll all get replaced by a single underscore.
  4. Trim underscores from the beginning and end of the string.

The code's on the complicated side because there are several tricky cases it needs to handle:

  • Bad characters on the ends need to be deleted, not converted to underscores. For example, the !) in your example.
  • We want foo_-_bar to turn into foo_bar, not foo___bar. Underscores should be collapsed, basically.
John Kugelman
  • 307,513
  • 65
  • 473
  • 519
0
$slug = preg_replace("[^a-z_]", "", $slug);
AndreKR
  • 28,030
  • 13
  • 86
  • 146