There is quite a bit to creating a canonical url apparently.
The url-normalize library is best that I have tested.
Depending on the source of your urls you may wish to clean them of other standard parameters such as UTM codes. w3lib.url.url_query_cleaner is useful for this.
Combining this with Ned Batchelder's answer could look something like:
Code:
from w3lib.url import url_query_cleaner
from url_normalize import url_normalize
urls = ['google.com',
'google.com/',
'http://google.com/',
'http://google.com',
'http://google.com?',
'http://google.com/?',
'http://google.com//',
'http://google.com?utm_source=Google']
def canonical_url(u):
u = url_normalize(u)
u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)
if u.startswith("http://"):
u = u[7:]
if u.startswith("https://"):
u = u[8:]
if u.startswith("www."):
u = u[4:]
if u.endswith("/"):
u = u[:-1]
return u
list(map(canonical_url,urls))
Result:
['google.com',
'google.com',
'google.com',
'google.com',
'google.com',
'google.com',
'google.com',
'google.com']