3

I need a regexp to strip out just the domain name part of a url. So for example if i had the following url:

http://www.website-2000.com

the bit I'd want the regex to match would be 'website-2000'

If you could also explain which each part of the regex does to help me understand it that would be great.

Thanks

geoffs3310
  • 12,250
  • 22
  • 61
  • 84
  • Possible duplicate of [Domain name validation with RegEx](https://stackoverflow.com/questions/10306690/domain-name-validation-with-regex) – csilk Nov 22 '17 at 00:22

5 Answers5

11

This one should work. There might be some faults with it, but none that I can think of right now. If anyone want to improve on it, feel free to do so.

/http:\/\/(?:www\.)?([a-z0-9\-]+)(?:\.[a-z\.]+[\/]?).*/i

http:\/\/            matches the "http://" part
(?:www\.)?           is a non-capturing group that matches zero or one "www."
([a-z0-9\-]+)        is a capturing group that matches character ranges a-z, 0-9
                     in addition to the hyphen. This is what you wanted to extract.
(?:\.[a-z\.]+[\/]?)  is a non-capturing group that matches the TLD part (i.e. ".com",
                     ".co.uk", etc) in addition to zero or one "/"
.*                   matches the rest of the url

http://rubular.com/r/ROz13NSWBQ

hlindset
  • 430
  • 2
  • 7
  • The `.*` in the end is wrong. Replace it with `[^ ]*`. It also captures characters after the domain name. For eg, in `http://www.website-2000.com jerry hates tom`, `jerry hates tom` will also be captured by regex. Not in scope of question, but will help for a broader usage of your regex. – Anshit Chaudhary Sep 28 '17 at 10:50
4

Let me introduce you this wonderful tool txt2re: regular expression generator

Here you can experiment with regex and generate code in many languages.

shanethehat
  • 15,105
  • 10
  • 54
  • 84
realbot
  • 299
  • 2
  • 3
0
r/^[^:]+:\/\/[^/?#]+//

This worked for me.

It will match any scheme or protocol and then after the :// matches any character that's not a / ? or #. These three characters, when they first occur in a URL, signal the end of the domain so that's were I end the match.

0
http://wwww.([^/]+)

No need to use regexp, use the urlparse module

>>> from urlparse import urlparse
>>> '.'.join(urlparse("http://www.website-2000.com").netloc.split('.')[-2:])
'website-2000.com'

Kimvais
  • 34,273
  • 14
  • 100
  • 135
0

This one allows you not to have to worry about any of the http/https/ftp etc... in front and also captures all your subdomains too.

(?:www\.)?([a-z0-9\-.]+)(?:\.[a-z\.]+[\/]?).*/i

The only times it fails that I've found are: - If a . precedes the domain/subdomain without any text before it, the . is included in the regex capture. - Emails with . in them will not work. (fix this by checking passed domain first for the @ symbol before running through regex) - Whitespace in the middle of the domain/subdomain

animuson
  • 50,765
  • 27
  • 132
  • 142
bradbyu
  • 1
  • 1