3

I'm involved in a web project. I have to choose the best ways to represent the code, so that other people can read it without problems/headaches/whatever.

The "problem" I've tackled now is to show a nice formatted url (will be taken from a "title" string).

So, let's suppose we have a title, fetched from the form:

title = request.form['title'] # 'Hello World, Hello Cat! Hello?'

then we need a function to format it for inclusion in the url (it needs to become 'hello_world_hello_cat_hello'), so for the moment I'm using this one which I think sucks for readability:

str.replace(title, ' ', '-').str.replace(title, '!', '').str.replace(title, '?', '').str.replace(string, ',' '').lower()

What would be a good way to compact it? Is there already a function for doing what I'm doing?

I'd also like to know which characters/symbols I should strip from the url.

6 Answers6

5

You can use urlencode() which is the way for url-encode strings in Python.

If otherwise you want a personalized encoding as your expected output and all you want to do is leave the words in the final string you can use the re.findall function to grab them and later join them with and underscore:

>>>s = 'Hello World, Hello Cat! Hello?'
>>>'_'.join(re.findall(r'\w+',s)).lower()
'hello_world_hello_cat_hello'

What this does is:

g = re.findall(r'\w+',s) # ['Hello', 'World', 'Hello', 'Cat', 'Hello']
s1 = '_'.join(g) # 'Hello_World_Hello_Cat_Hello'
s1.lower() # 'hello_world_hello_cat_hello'

This technique also works well with numbers in the string:

>>>s = 'Hello World, Hello Cat! H123ello? 123'
>>>'_'.join(re.findall(r'\w+',s)).lower()
'hello_world_hello_cat_h123ello_123'

Another way which I think should be faster is to actually replace non alphanumeric chars. This can be accomplished with re.sub by grabbing all the non alphanumerics toghether and replace them with _ like this:

>>>re.sub(r'\W+','_',s).lower()
'hello_world_hello_cat_h123ello_123'

Well... not really, speed tests:

$python -mtimeit -s "import re" -s "s='Hello World, Hello Cat! Hello?'" "'_'.join(re.findall(r'\w+',s)).lower()"
100000 loops, best of 3: 5.08 usec per loop


$python -mtimeit -s "import re" -s "s='Hello World, Hello Cat! Hello?'" "re.sub(r'\W+','_',s).lower()"
100000 loops, best of 3: 6.55 usec per loop
Paulo Bu
  • 27,056
  • 6
  • 67
  • 69
  • 1
    wow that's really smart! Much more concise than my answer. I'd be interested in speed, but honestly this DOES NOT appear to be a function that will limit performance so it's definitely early optimization to be worried about that – Adam Smith Apr 17 '14 at 19:38
  • @AdamSmith `translate` is really fast and I really try to avoid regular expressions as much as I can but this task seemed to be really simple using them :) – Paulo Bu Apr 17 '14 at 19:41
3

You could use urlencode() from the urllib module in python2 or urllib.parse module in python3.

This will work assuming you're trying to use the text in the query string of your URL.

title = {'title': 'Hello World, Hello Cat! Hello?'} # or get it programmatically as you did
encoded = urllib.urlencode(title)
print encoded # title=Hello+World%2C+Hello+Cat%21+Hello%3F
jshanley
  • 8,582
  • 1
  • 32
  • 43
  • As per OP: "so that other people can read it without problems/headaches/whatever." I've got a headache just looking at `%2C` :) Otherwise a stellar solution, and one I would use myself, so still +1 – Adam Smith Apr 17 '14 at 19:49
  • @AdamSmith I thought OP was referring to code readability, not the url, but I could be mistaken. – jshanley Apr 17 '14 at 19:56
  • @jshanley upon a second read, you might be right. Regardless, this is the BEST solution :) – Adam Smith Apr 17 '14 at 20:00
  • I was also confused by the OP's readability comment and the output example. This is, of course _the_ way for url encode strings in Python :) – Paulo Bu Apr 17 '14 at 20:02
3

So I've been playing with all your answer's solutions and here's what I've come up with.

note: These "benchmarks" are not to be taken too seriously, as I didn't go through all the possible plans, but it's a good way to have a fast broad view.

re.findall()

def findall():
  string = 'Hello World, Hello Cat! Hello?'
  return  '_'.join(re.findall(r'\w+',string)).lower()

real=0.019s, user=0.012s, sys=0.004s, rough=0.016s

re.sub()

def sub():
  string = 'Hello World, Hello Cat! Hello?'
  return re.sub(r'\W+','_',string).lower()

real=0.020s, user=0.016s, sys=0.004s, rough=0.020s

slugify()

def slug():
  string = 'Hello World, Hello Cat! Hello?'
  return slugify(string)

real=0.031s, user=0.024s, sys=0.004s, rough=0.028s

urllib.urlencode()

def urlenc():
  string = {'title': 'Hello World, Hello Cat! Hello?'}
  return urllib.urlencode(string)

real=0.036s, user=0.024s, sys=0.008s, rough=0.032s

As you can see, the fastest is re.findall(), the slowest urllib.urlencode() and in the middle there's slugify() which is also the shortest/cleanest of them all (altough not the fastest).

What I've chosen for now is Slugify, the lucky cat in between the bulldogs.

Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
1
import re
re.sub(r'!|\?|,', '', text)

This will remove ! ? and , from the string.

jibreel
  • 249
  • 2
  • 8
0

sure you can do this:

import string

uppers = string.ascii_uppercase # ABC...Z
lowers = string.ascii_lowercase # abc...z
removals = ''.join([ch for ch in string.punctuation if ch != '_'])

transtable = str.maketrans(uppers+" ",lowers+"_",removals)
title = "Hello World, Hello Cat! Hello?"
title.translate(transtable)

You could also do a list comp and ''.join it.

whitelist = string.ascii_uppercase + string.ascii_lowercase + " "

newtitle = ''.join('_' if ch == ' ' else ch.lower() for ch in title if ch in
             whitelist)
Adam Smith
  • 45,072
  • 8
  • 62
  • 94
0

I mean you could split it up into multiple statements:

str = str.replace(title, ' ', '-')
str = str.replace(title, '!', '')
str = str.replace(title, '?', '')
str = str.replace(string, ',' '')
str = str.lower()

This will make for better readability.

heinst
  • 7,458
  • 5
  • 34
  • 73