24

I would like to sanitize a string in to a URL so this is what I basically need:

  1. Everything must be removed except alphanumeric characters and spaces and dashed.
  2. Spaces should be converter into dashes.

Eg.

This, is the URL!

must return

this-is-the-url
mickmackusa
  • 33,121
  • 11
  • 58
  • 86
Atif
  • 9,948
  • 19
  • 60
  • 95
  • Hi jens, I am clueless about the code and thats what I need help for. The only thing I know is it should use preg_replace() but I dont know what the regular expression should be. Thanks – Atif Jun 11 '10 at 11:09

9 Answers9

50
function slug($z){
    $z = strtolower($z);
    $z = preg_replace('/[^a-z0-9 -]+/', '', $z);
    $z = str_replace(' ', '-', $z);
    return trim($z, '-');
}
SilentGhost
  • 264,945
  • 58
  • 291
  • 279
  • great thanks.. Just one edit.. I want to remove dashes from beginning and end before returning $z just in case they exists. – Atif Jun 11 '10 at 11:19
  • -1: Reading between the lines of what SilentGhost *intends* rather than the code he/she has written. this appears url-safe, it's at the cost of loss of information. The right way to encode data for a URL is to use urlencode(). – symcbean Jun 11 '10 at 11:25
  • 1
    (I see it does the translation shown in the example - but not what atif089 asked for) – symcbean Jun 11 '10 at 11:27
  • @symcbean: most people on board of this planetary ship are not comfortable `urldecode`-ing on the fly. – SilentGhost Jun 11 '10 at 11:31
  • 8
    @symcbean urlecode is not what I needed because I want to eliminate symbols rather than converting them. So this is exactly what I wanted. – Atif Jun 11 '10 at 11:37
  • shorter: strtolower(trim(preg_replace("/[^\w]+/", "-", $z), "-")) – mario Jun 11 '10 at 11:58
  • 1
    @mario: 1. it doesn't do the same processing; 2. it's a maintenance nightmare. – SilentGhost Jun 11 '10 at 12:11
4

First strip unwanted characters

$new_string = preg_replace("/[^a-zA-Z0-9\s]/", "", $string);

Then changes spaces for unserscores

$url = preg_replace('/\s/', '-', $new_string);

Finally encode it ready for use

$new_url = urlencode($url);
Victor Bocharsky
  • 10,423
  • 11
  • 51
  • 87
Rooneyl
  • 7,497
  • 5
  • 50
  • 73
  • 1
    underscore is a different character: `_` is an underscore, `-` is a hyphen. Also using `urlencode` on such a string doesn't change anything. You're also forgetting hypen in the first regex and `\s` is not equivalent to space character. – SilentGhost Jun 11 '10 at 11:22
1

This will do it in a Unix shell (I just tried it on my MacOS):

$ tr -cs A-Za-z '-' < infile.txt > outfile.txt

I got the idea from a blog post on More Shell, Less Egg

1

Try This

 function clean($string) {
       $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
       $string = preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.

       return preg_replace('/-+/', '-', $string); // Replaces multiple hyphens with single one.
    }

Usage:

echo clean('a|"bc!@£de^&$f g');

Will output: abcdef-g

source : https://stackoverflow.com/a/14114419/2439715

Community
  • 1
  • 1
Abhishek Goel
  • 15,517
  • 8
  • 81
  • 62
0

All previous asnwers deal with url, but in case some one will need to sanitize string for login (e.g.) and keep it as text, here is you go:

function sanitizeText($str) {
    $withSpecCharacters = htmlspecialchars($str);
    $splitted_str = str_split($str);
    $result = '';
    foreach ($splitted_str as $letter){
        if (strpos($withSpecCharacters, $letter) !== false) {
            $result .= $letter;
        }
    }
    return $result;
}

echo sanitizeText('ОРРииыфвсси ajvnsakjvnHB "&nvsp;\n" <script>alert()</script>');
//ОРРииыфвсси ajvnsakjvnHB &nvsp;\n scriptalert()/script
//No injections possible, all info at max keeped
Denis Matafonov
  • 2,260
  • 17
  • 27
0

You should use the slugify package and not reinvent the wheel ;)

https://github.com/cocur/slugify

DjimOnDev
  • 386
  • 3
  • 13
0
    function isolate($data) {
        
        $data = trim($data);
        $data = stripslashes($data);
        $data = htmlspecialchars($data);
        
        return $data;
    }
Hello Hack
  • 73
  • 4
  • 3
    Please add more information with your code, maybe how to use or how you got to this answer. Thank you. – Mehrad Jul 16 '20 at 01:24
0

The OP is not explicitly describing all of the attributes of a slug, but this is what I am gathering from the intent.

My interpretation of a perfect, valid, condensed slug aligns with this post: https://wordpress.stackexchange.com/questions/149191/slug-formatting-acceptable-characters#:~:text=However%2C%20we%20can%20summarise%20the,or%20end%20with%20a%20hyphen.

I find none of the earlier posted answers to achieve this consistently (and I'm not even stretching the scope of the question to include multi-byte characters).

  1. convert all characters to lowercase
  2. replace all sequences of one or more alphanumeric characters to a singe hyphen.
  3. trim the leading and trailing hyphens from the string. Done.

I recommend the following one-liner which doesn't bother declaring single-use variables:

return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower($string)), '-');

I have also prepared a demonstration which highlights what I consider to be inaccuracies in the other answers. (Demo)

'This, is - - the URL!' input
'this-is-the-url'       expected

'this-is-----the-url'   SilentGhost
'this-is-the-url'       mario
'This-is---the-URL'     Rooneyl
'This-is-the-URL'       AbhishekGoel
'This, is - - the URL!' HelloHack
'This, is - - the URL!' DenisMatafonov
'This,-is-----the-URL!' AdeelRazaAzeemi
'this-is-the-url'       mickmackusa

---
'Mork & Mindy'      input
'mork-mindy'        expected

'mork--mindy'       SilentGhost
'mork-mindy'        mario
'Mork--Mindy'       Rooneyl
'Mork-Mindy'        AbhishekGoel
'Mork &amp; Mindy'  HelloHack
'Mork & Mindy'      DenisMatafonov
'Mork-&-Mindy'      AdeelRazaAzeemi
'mork-mindy'        mickmackusa

---
'What the_underscore ?!?'   input
'what-the-underscore'       expected

'what-theunderscore'        SilentGhost
'what-the_underscore'       mario
'What-theunderscore-'       Rooneyl
'What-theunderscore-'       AbhishekGoel
'What the_underscore ?!?'   HelloHack
'What the_underscore ?!?'   DenisMatafonov
'What-the_underscore-?!?'   AdeelRazaAzeemi
'what-the-underscore'       mickmackusa
mickmackusa
  • 33,121
  • 11
  • 58
  • 86
-1

The following will replace spaces with dashes.

$str = str_replace(' ', '-', $str);

Then the following statement will remove everything except alphanumeric characters and dashed. (didn't have spaces because in previous step we had replaced them with dashes.

// Char representation     0 -  9   A-   Z   a-   z  -    
$str = preg_replace('/[^\x30-\x39\x41-\x5A\x61-\x7A\x2D]/', '', $str);

Which is equivalent to

$str = preg_replace('/[^0-9A-Za-z-]+/', '', $str);

FYI: To remove all special characters from a string use

$str = preg_replace('/[^\x20-\x7E]/', '', $str); 

\x20 is hexadecimal for space that is start of Acsii charecter and \x7E is tilde. As accordingly to wikipedia https://en.wikipedia.org/wiki/ASCII#Printable_characters

FYI: look into the Hex Column for the interval 20-7E

Printable characters Codes 20hex to 7Ehex, known as the printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. There are 95 printable characters in total.