9

The method should allows only "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-" chars in URI strings.

What is the best way to make nice SEO URI string?

Community
  • 1
  • 1
MatBanik
  • 24,206
  • 38
  • 107
  • 172
  • 3
    This sounds like a terrible idea. Consider [encoding the URL](http://download.oracle.com/javase/1.5.0/docs/api/java/net/URLEncoder.html) instead. – moinudin Jan 02 '11 at 23:19
  • @marcog: It sounds a lot like what SO does to generate SEO-friendly URLs from titles. Mind you, I'd be very tempted to just replace all non-alnum char sequences with a single hyphen; same general effect (if perhaps slightly different in edge cases) but easier to understand. – Donal Fellows Jan 02 '11 at 23:27
  • @Donal Oh, right. Surely you'd generate a random string from the set of allowed characters though? – moinudin Jan 02 '11 at 23:28
  • 1
    @marcog: What SO does is put that part (which actually *doesn't matter*) on the end of the URL; the path fragment before is an ID which is what actually locates the question. It's safe to use user input for this because the sanitization is defined in terms of a severe whitelist of characters. (Random string? Where did that come from?) – Donal Fellows Jan 03 '11 at 20:01
  • @Donal Okay, I see what you're referring to. I thought you meant the ID, e.g. 4581025 for this question. Thanks for clarifying! – moinudin Jan 03 '11 at 20:09

3 Answers3

35

This is what the general consensus is:

  1. Lowercase the string.

    string = string.toLowerCase();
    
  2. Normalize all characters and get rid of all diacritical marks (so that e.g. é, ö, à becomes e, o, a).

    string = Normalizer.normalize(string, Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
    
  3. Replace all remaining non-alphanumeric characters by - and collapse when necessary.

    string = string.replaceAll("[^\\p{Alnum}]+", "-");
    

So, summarized:

public static String toPrettyURL(String string) {
    return Normalizer.normalize(string.toLowerCase(), Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "")
        .replaceAll("[^\\p{Alnum}]+", "-");
}
BalusC
  • 992,635
  • 352
  • 3,478
  • 3,452
  • 4
    Also recommend removing leading and trailing punctuation. // remove trailing punctuation .replaceAll("[^a-z0-9]+$", "") // remove leading punctuation .replaceAll("^[^a-z0-9]+", ""); – Jason Thrasher May 06 '12 at 18:24
4

The following regex will do the same thing as your algorithm. I'm not aware of libraries for doing this type of thing.

String s = input
.replaceAll(" ?- ?","-") // remove spaces around hyphens
.replaceAll("[ ']","-") // turn spaces and quotes into hyphens
.replaceAll("[^0-9a-zA-Z-]",""); // remove everything not in our allowed char set
killdash9
  • 2,088
  • 2
  • 21
  • 17
1

These are commonly called "slugs" if you want to search for more information.

You may want to check out other answers such as How can I create a SEO friendly dash-delimited url from a string? and How to make Django slugify work properly with Unicode strings?

They cover C# and Python more than javascript but have some language-agnostic discussion about slug conventions and issues you may face when making them (such as uniqueness, unicode normalization problems, etc).

Community
  • 1
  • 1
Josh Segall
  • 3,915
  • 4
  • 27
  • 25