regex needed to strip out domain name

Question

I need a regexp to strip out just the domain name part of a url. So for example if i had the following url:

http://www.website-2000.com

the bit I'd want the regex to match would be 'website-2000'

If you could also explain which each part of the regex does to help me understand it that would be great.

Thanks

Possible duplicate of [Domain name validation with RegEx](https://stackoverflow.com/questions/10306690/domain-name-validation-with-regex) — csilk, Nov 22 '17 at 00:22

score 11 · Accepted Answer · answered Jan 25 '11 at 09:40

This one should work. There might be some faults with it, but none that I can think of right now. If anyone want to improve on it, feel free to do so.

/http:\/\/(?:www\.)?([a-z0-9\-]+)(?:\.[a-z\.]+[\/]?).*/i

http:\/\/            matches the "http://" part
(?:www\.)?           is a non-capturing group that matches zero or one "www."
([a-z0-9\-]+)        is a capturing group that matches character ranges a-z, 0-9
                     in addition to the hyphen. This is what you wanted to extract.
(?:\.[a-z\.]+[\/]?)  is a non-capturing group that matches the TLD part (i.e. ".com",
                     ".co.uk", etc) in addition to zero or one "/"
.*                   matches the rest of the url

http://rubular.com/r/ROz13NSWBQ

The `.*` in the end is wrong. Replace it with `[^ ]*`. It also captures characters after the domain name. For eg, in `http://www.website-2000.com jerry hates tom`, `jerry hates tom` will also be captured by regex. Not in scope of question, but will help for a broader usage of your regex. — Anshit Chaudhary, Sep 28 '17 at 10:50

score 4 · Answer 2 · edited Jul 28 '11 at 10:05

4

Let me introduce you this wonderful tool txt2re: regular expression generator

Here you can experiment with regex and generate code in many languages.

edited Jul 28 '11 at 10:05

shanethehat

15,105
10
54
84

answered Jan 25 '11 at 09:38

realbot

299
2
3

That saves me so much time! – Zach Saucier Dec 11 '13 at 21:35
the link is dead now – Anytoe Aug 24 '20 at 09:05

zeffdotorg · Answer 3 · 2017-11-22T01:11:39.877

0

r/^[^:]+:\/\/[^/?#]+//

This worked for me.

It will match any scheme or protocol and then after the :// matches any character that's not a / ? or #. These three characters, when they first occur in a URL, signal the end of the domain so that's were I end the match.

edited Nov 22 '17 at 01:11

answered Nov 22 '17 at 00:07

zeffdotorg

1
1

Kimvais · Answer 4 · 2011-01-25T09:48:00.173

0

http://wwww.([^/]+)

~~No need to use regexp, use the urlparse module~~

>>> from urlparse import urlparse
>>> '.'.join(urlparse("http://www.website-2000.com").netloc.split('.')[-2:])
'website-2000.com'

edited Jan 25 '11 at 09:48

answered Jan 25 '11 at 09:32

Kimvais

34,273
14
100
135

sorry I need to do it with regex – geoffs3310 Jan 25 '11 at 09:33
Oh, stupid me, didn't notice that this wasn't a python question – Kimvais Jan 25 '11 at 09:37
Well, that's certainly a bit simpler than my behemoth. – hlindset Jan 25 '11 at 09:53
The expression: `http://wwww.([^/]+)` does not work for: `http://example.com` or `http://www.example.com?qvar=qval`. – ridgerunner Oct 10 '11 at 15:01
because there are 4 Ws in this regex – Khan Shahrukh May 27 '17 at 18:41

score 0 · Answer 5 · edited Nov 24 '11 at 20:52

This one allows you not to have to worry about any of the http/https/ftp etc... in front and also captures all your subdomains too.

(?:www\.)?([a-z0-9\-.]+)(?:\.[a-z\.]+[\/]?).*/i

The only times it fails that I've found are: - If a . precedes the domain/subdomain without any text before it, the . is included in the regex capture. - Emails with . in them will not work. (fix this by checking passed domain first for the @ symbol before running through regex) - Whitespace in the middle of the domain/subdomain

regex needed to strip out domain name

5 Answers5

Linked

Related