13

Say I have this webpage:
http://ww.xyz.com/Product.aspx?CategoryId=1

If the name of CategoryId=1 is "Dogs" I would like to convert the URL into something like this:
http://ww.xyz.com/Products/Dogs

The problem is if the category name contains foreign (or invalid for a url) characters. If the name of CategoryId=2 is "Göra äldre", what should be the new url?

Logically it should be:
http://ww.xyz.com/Products/Göra äldre
but it will not work. Firstly because of the space (which I can easily replace by a dash for example) but what about the foreign characters? In Asp.net I could use the URLEncode function which would give something like this:
http://ww.xyz.com/Products/G%c3%b6ra+%c3%a4ldre
but I can't really say it's better than the original url (http://ww.xyz.com/Product.aspx?CategoryId=2)

Ideally I would like to generate this one but how can I can do this automatically (ie converting foreign characters to 'safe' url characters):
http://ww.xyz.com/Products/Gora-aldre

Anthony
  • 6,880
  • 12
  • 57
  • 70

4 Answers4

34

I've come up with the 2 following extension methods (asp.net / C#):

     public static string RemoveAccent(this string txt)
    {
        byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt);
        return System.Text.Encoding.ASCII.GetString(bytes);
    }

    public static string Slugify(this string phrase)
    {
        string str = phrase.RemoveAccent().ToLower();
        str = System.Text.RegularExpressions.Regex.Replace(str, @"[^a-z0-9\s-]", ""); // Remove all non valid chars          
        str = System.Text.RegularExpressions.Regex.Replace(str, @"\s+", " ").Trim(); // convert multiple spaces into one space  
        str = System.Text.RegularExpressions.Regex.Replace(str, @"\s", "-"); // //Replace spaces by dashes
        return str;
    }
Anthony
  • 6,880
  • 12
  • 57
  • 70
  • 3
    I wrote out a huge method full of if statements using the char class until I found this. Good stuff. – The Muffin Man Sep 28 '12 at 03:46
  • I think ISAPI does the same, but I want more control over my URLs. This is a good solution. – Erik Bergstedt Mar 15 '13 at 07:49
  • 1
    Thanks for the function. I had to add another piece at the end to replace two or more hyphens with a single hyphen. str = System.Text.RegularExpressions.Regex.Replace(str, @"\-+", "- "); // convert multiple hyphens into one hyphen – Richard Edwards Jul 09 '14 at 15:50
  • I used this beautiful solution but I kept accents and replaced [^a-z0-9\s-] with this [^\w\d\s-] to support Unicode URL's. Up vote this or hug me. – Shadi Namrouti Aug 22 '17 at 09:25
2

Transliterate non-ASCII characters to ASCII, using something like this:

var str = "éåäöíØ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); 

=> "eaaoiO"

(Source)

Community
  • 1
  • 1
Sjoerd
  • 68,958
  • 15
  • 118
  • 167
  • 1
    What if some characters are not Cyrillic? I need a solution which will always work. – Anthony Jul 18 '10 at 11:11
  • Then you'll need to add more checks for different types of encoding. Unfortunately there's no magic wand here unless you use a library that does it all for you. – hollsk Jul 18 '10 at 11:19
  • 1
    Maybe the UnidecodeSharp library is what you are looking for: http://unidecode.codeplex.com/ – Sjoerd Jul 18 '10 at 11:22
1

One other thing worth considering:

If a user provides a string such as 好听的音乐 which you want to convert to a URL friendly title then you should consider using IdnMapping

For example:

string urlFriendlyTitle = Slugify(url);

public static string Slugify(string text)
{
    IdnMapping idnMapping = new IdnMapping();
    text = idnMapping.GetAscii(text);

    text = RemoveAccent(text).ToLower();

    //  Remove all invalid characters.  
    text = Regex.Replace(text, @"[^a-z0-9\s-]", "");

    //  Convert multiple spaces into one space
    text = Regex.Replace(text, @"\s+", " ").Trim();

    //  Replace spaces by underscores.
    text = Regex.Replace(text, @"\s", "_");

    return text;
}

public static string RemoveAccent(string text)
{
    byte[] bytes = Encoding.GetEncoding("Cyrillic").GetBytes(text);

    return Encoding.ASCII.GetString(bytes);
}

Without this, 好听的音乐 will be converted to string.Empty. With this, xn--fjqr6lw2ek78az68a which is punycode

Sean Anderson
  • 24,467
  • 26
  • 112
  • 219
-1

I use the function described at http://www.blackbeltcoder.com/Articles/strings/converting-text-to-a-url-friendly-slug. It doesn't directly support non-English characters, but could be easily updated to support additional characters.

I like it because it produces a very clean-looking slug.

Jonathan Wood
  • 59,750
  • 65
  • 229
  • 380
  • In your TextToSlug function what if the string to convert contains an accent? For example 'fiancé' which is a perfect English word. There are plenty of similar examples in English. IsLetterOrDigit will return true for the é character so you would end up with it in your url which would be incorrect as ideally é should be converted to e in the url. – Anthony Dec 18 '10 at 08:25
  • What does "ideally" mean here? Are you saying fiancé is invalid within a URL? This hasn't come while I've been using my code, but I'm more than happy to modify it if this causes problems. – Jonathan Wood Dec 18 '10 at 08:37