1

I want my string to only have alphanumeric characters, -, and underscores. Thats it. I am trying to write a method that takes in a user input string and converts it so that it follows the guideline.

My regex is obviously a-zA-Z0-9_-. What I want to do is replace all the spaces with the -, and just remove all the other characters that don't fall under my regex.

So, the string 'Hello, world!' would get converted into 'Hello-world'. The special characters get removed, and the space is replaced with a -.

What would be the most efficient way to do this using python? Do I have to iterate over the entire string character by character, or is there a better way? Thanks!

darkhorse
  • 5,888
  • 11
  • 39
  • 105

2 Answers2

3

You can do it with two subs: 1) replace spaces with -; 2) remove other unwanted characters:

s = 'Hello, world!'

import re
re.sub("[^a-zA-Z_-]", "", re.sub("\s+", "-", s))
# 'Hello-world'

If you want to keep digits in your string:

re.sub("[^a-zA-Z0-9_-]", "", re.sub("\s+", "-", s))
# 'Hello-world'

Here [^a-zA-Z_-] matches a single character that is not a letter(upper and lower case), underscore and dash, the dash needs to be placed at the end of the character class [] so that it won't be treated as range but literal.

Psidom
  • 171,477
  • 20
  • 249
  • 286
1

What you want is also often used when generating URL names for content. It is implemented in django.utils.text.slugify. The slugify function converts to lowercase though. Here is a simplified version of Djangos slugify function that preserves case:

import re
def slugify(value):
    value = re.sub('[^A-Za-z_\s-]', '', value, flags=re.U).strip()
    return re.sub('[-\s]+', '-', value, flags=re.U)
print(slugify("Hello World!"))
# Hello-World
Tristan
  • 1,446
  • 7
  • 11