106

I am in search of the best way to "slugify" string what "slug" is, and my current solution is based on this recipe

I have changed it a little bit to:

s = 'String to slugify'

slug = unicodedata.normalize('NFKD', s)
slug = slug.encode('ascii', 'ignore').lower()
slug = re.sub(r'[^a-z0-9]+', '-', slug).strip('-')
slug = re.sub(r'[-]+', '-', slug)

Anyone see any problems with this code? It is working fine, but maybe I am missing something or you know a better way?

Community
  • 1
  • 1
Zygimantas
  • 6,047
  • 7
  • 36
  • 51
  • 1
    are you working with unicode alot? if so, the last re.sub might be better if you wrap unicode() around it, This is what django does. Also, the [^a-z0-9]+ can be shortened to use \w . see django.template.defaultfilters, it's close to yours, but a bit more refined. – Mike Ramirez Apr 07 '11 at 00:23
  • Are unicode characters allowed in URL? Also, I have changed \w to a-z0-9 because \w includes _ character and uppercase letters. Letters are set to lowercase in advance, so there will be no uppercase letters to match. – Zygimantas Apr 07 '11 at 01:21
  • '_' is valid (but your choice, you did ask), unicode is as percent encoded chars. – Mike Ramirez Apr 07 '11 at 01:36
  • Thank you Mike. Well, I asked a wrong question. Is there any reason to encode it back to unicode string, if we already replaced all characters except "a-z", "0-9" and "-" ? – Zygimantas Apr 07 '11 at 01:47
  • For django, I believe it's important to them to have it all strings as unicode objects for compatibility. It's your choice if you want this. – Mike Ramirez Apr 07 '11 at 01:51
  • I did a pull request to the slugify (https://github.com/zacharyvoase/slugify) Python lib which addresses all the issues: https://github.com/ksamuel/slugify. Standalone pure python pip installable slugify using unicodata, or unidecode if installed. You can choose a custom separator and even keep all non ASCII characters. I hope it will be accepted soon and pushed to pypi. – e-satis Jul 22 '12 at 18:31

10 Answers10

161

There is a python package named python-slugify, which does a pretty good job of slugifying:

pip install python-slugify

Works like this:

from slugify import slugify

txt = "This is a test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = "This -- is a ## test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = 'C\'est déjà l\'été.'
r = slugify(txt)
self.assertEquals(r, "cest-deja-lete")

txt = 'Nín hǎo. Wǒ shì zhōng guó rén'
r = slugify(txt)
self.assertEquals(r, "nin-hao-wo-shi-zhong-guo-ren")

txt = 'Компьютер'
r = slugify(txt)
self.assertEquals(r, "kompiuter")

txt = 'jaja---lol-méméméoo--a'
r = slugify(txt)
self.assertEquals(r, "jaja-lol-mememeoo-a")

See More examples

This package does a bit more than what you posted (take a look at the source, it's just one file). The project is still active (got updated 2 days before I originally answered, over seven years later (last checked 2020-06-30), it still gets updated).

careful: There is a second package around, named slugify. If you have both of them, you might get a problem, as they have the same name for import. The one just named slugify didn't do all I quick-checked: "Ich heiße" became "ich-heie" (should be "ich-heisse"), so be sure to pick the right one, when using pip or easy_install.

kratenko
  • 6,652
  • 4
  • 32
  • 57
  • 6
    `python-slugify` is licensed under MIT, but it uses `Unidecode` which is licensed under GPL, so it might not fit for some projects. – Rotareti Aug 06 '17 at 21:40
  • @Rotareti Could you please explain for me why it is could not fit all the projects? Can't we use anything under MIT or GPL license and include them inside commercial software? I think the only restriction is putting the license besides the codes we develop. Am I wrong? – Ghassem Tofighi Jul 14 '19 at 22:18
  • 1
    @GhassemTofighi In short: You can use it in your commercial software, but if you use it, you must open source your code as well. Anyway IANAL and this is no legal advice. – Rotareti Jul 15 '19 at 08:04
  • @GhassemTofighi maybe take a look at https://softwareengineering.stackexchange.com/q/47032/71504 on that topic – kratenko Jul 17 '19 at 09:11
  • 1
    @Rotareti `python-slugify` now defaults to the Artistic License'd `text-unidecode` instead of the GPL-licensed `Unidecode`, addressing your licensing concern. https://github.com/un33k/python-slugify/commit/b8be7d69119dcceb9a3e0ce64a509415737190ac#diff-e4156a8bee1b298082516842836621b9 – Emilien Jul 27 '19 at 23:49
32

Install unidecode form from here for unicode support

pip install unidecode

# -*- coding: utf-8 -*-
import re
import unidecode

def slugify(text):
    text = unidecode.unidecode(text).lower()
    return re.sub(r'[\W_]+', '-', text)

text = u"My custom хелло ворлд"
print slugify(text)

>>> my-custom-khello-vorld

Arne
  • 10,476
  • 3
  • 48
  • 66
Normunds
  • 401
  • 1
  • 5
  • 7
  • 1
    hi, its a bit strange but it give for my res like that "my-custom-ndud-d-d3-4-d2d3-4nd-d-" – derevo Jul 30 '12 at 07:04
  • 1
    @derevo that happend when you don't send unicode strings. Replace `slugify("My custom хелло ворлд")` with `slugify(u"My custom хелло ворлд")`, and it should work. – kratenko Dec 16 '12 at 12:10
  • 10
    I would suggest against using variable names like `str`. This hides the builtin `str` type. – crodjer Apr 19 '14 at 07:22
  • 2
    unidecode is GPL, which may not be suitable for some. – Jorge Leitao Apr 25 '15 at 06:59
  • What about the reslugifying or deslugifying. – Ryan Chou Jan 24 '19 at 03:46
  • @RyanChou that can't be done unambigiously, best effort you can do is something like replace "-" with space and make first letter uppercase, but you can't tell if "my-custom-khello-vorld" was "My custom хелло ворлд", or "MY ČUSTOM KHELLO-VORLD" or anything else that slugifies into that concrete slug – M. Volf Dec 23 '19 at 16:52
  • @M.Volf Got it. – Ryan Chou Dec 24 '19 at 16:36
11

There is python package named awesome-slugify:

pip install awesome-slugify

Works like this:

from slugify import slugify

slugify('one kožušček')  # one-kozuscek

awesome-slugify github page

voronin
  • 535
  • 4
  • 7
7

It works well in Django, so I don't see why it wouldn't be a good general purpose slugify function.

Are you having any problems with it?

Nick Presta
  • 26,924
  • 6
  • 51
  • 73
6
def slugify(value):
    """
    Converts to lowercase, removes non-word characters (alphanumerics and
    underscores) and converts spaces to hyphens. Also strips leading and
    trailing whitespace.
    """
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub('[^\w\s-]', '', value).strip().lower()
    return mark_safe(re.sub('[-\s]+', '-', value))
slugify = allow_lazy(slugify, six.text_type)

This is the slugify function present in django.utils.text This should suffice your requirement.

Animesh Sharma
  • 2,850
  • 1
  • 13
  • 29
6

The problem is with the ascii normalization line:

slug = unicodedata.normalize('NFKD', s)

It is called unicode normalization which does not decompose lots of characters to ascii. For example, it would strip non-ascii characters from the following strings:

Mørdag -> mrdag
Æther -> ther

A better way to do it is to use the unidecode module that tries to transliterate strings to ascii. So if you replace the above line with:

import unidecode
slug = unidecode.unidecode(s)

You get better results for the above strings and for many Greek and Russian characters too:

Mørdag -> mordag
Æther -> aether
Björn Lindqvist
  • 16,492
  • 13
  • 70
  • 103
3

Unidecode is good; however, be careful: unidecode is GPL. If this license doesn't fit then use this one

BomberMan
  • 1,080
  • 3
  • 13
  • 33
Mikhail Korobov
  • 20,041
  • 6
  • 65
  • 61
2

A couple of options on GitHub:

  1. https://github.com/dimka665/awesome-slugify
  2. https://github.com/un33k/python-slugify
  3. https://github.com/mozilla/unicode-slugify

Each supports slightly different parameters for its API, so you'll need to look through to figure out what you prefer.

In particular, pay attention to the different options they provide for dealing with non-ASCII characters. Pydanny wrote a very helpful blog post illustrating some of the unicode handling differences in these slugify'ing libraries: http://www.pydanny.com/awesome-slugify-human-readable-url-slugs-from-any-string.html This blog post is slightly outdated because Mozilla's unicode-slugify is no longer Django-specific.

Also note that currently awesome-slugify is GPLv3, though there's an open issue where the author says they'd prefer to release as MIT/BSD, just not sure of the legality: https://github.com/dimka665/awesome-slugify/issues/24

Jeff Widman
  • 16,338
  • 10
  • 59
  • 80
1

You might consider changing the last line to

slug=re.sub(r'--+',r'-',slug)

since the pattern [-]+ is no different than -+, and you don't really care about matching just one hyphen, only two or more.

But, of course, this is quite minor.

unutbu
  • 711,858
  • 148
  • 1,594
  • 1,547
0

Another option is boltons.strutils.slugify. Boltons has quite a few other useful functions as well, and is distributed under a BSD license.

ostrokach
  • 12,120
  • 6
  • 59
  • 78