872

How can I check if a given string is a valid URL address?

My knowledge of regular expressions is basic and doesn't allow me to choose from the hundreds of regular expressions I've already seen on the web.

iYoung
  • 3,413
  • 3
  • 29
  • 56
Vitor Silva
  • 15,284
  • 8
  • 30
  • 27
  • 38
    Any URL or just HTTP? E.g. does mailto:me@example.com count as a URL? A a AIM chat link? – Mecki Oct 02 '08 at 11:01
  • this is related to http://stackoverflow.com/questions/82398/how-to-match-uris-in-text#83378 – jamesh Oct 02 '08 at 11:39
  • 5
    If a URL has no leading "http(etc)", how would you be able to distinguish it from any other arbitrary string that happens to have dots in it? Say something like "MyClass.MyProperty.MyMethod"? Or "I somtimes miss the spacebar.is this a problem?" – Tomalak May 07 '09 at 08:51
  • 1
    i've already prefixed 'http:/ /www.' before the textbox. so the user doesn't need to enter 'http:/ /www.' and should just be concerned with entering the required uri name. – input May 07 '09 at 09:07
  • So you don't want to *find* the URL in a text, but you want to *validate* user input? That's an important distinction. You should add that to the question. – Tomalak May 07 '09 at 09:10
  • alright, done. edited the question. – input May 07 '09 at 09:21
  • 2
    What programming language are you using? You probably don't want to reinvent the wheel. – a paid nerd May 11 '09 at 05:49
  • 11
    Microsoft has a Regex page that includes an expression for URLs. Surely a good start: http://msdn.microsoft.com/en-us/library/ff650303.aspx NB. The above page is retired, but the expressions in the table are essentially still valid for reference. The URL expression recommended (and which worked great for me) is: "^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$" – CMH Feb 01 '12 at 23:39
  • 1
    with this expression `"^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-‌​\.\?\,\'\/\\\+&%\$#_]*)?$`the url pattern not valid. – Harmeet Singh Taara Jun 14 '13 at 09:45
  • It should be noted that without the leading 'http://' the string you're hoping to validate cannot be called a URL. URLs must be in the form: "scheme://domain:port/path?query_string#fragment_id" where the port, query_string and fragment_id (and their leading delimiters) are optional. – Rob Raisch Aug 21 '13 at 19:04
  • 1
    As [Andy Lester said](http://stackoverflow.com/a/18724720/1269037), use a specialized URL detection library that has been tested, tested, and tested again. – Dan Dascalescu Feb 21 '14 at 02:54
  • To find URLs within a string: https://mathiasbynens.be/demo/url-regex – Martin Thoma Aug 17 '17 at 09:36
  • I just found this and am wondering: why is it required to express this problem in a much less readable way like regular expressions instead of simply coding it out in easily readable code? – Ravior May 02 '19 at 12:44
  • @MartinThoma That link says "I have no interest in parsing a list of URLs from a given string of text" – Michael Mrozek Sep 17 '19 at 21:11
  • @MichaelMrozek and it continues with "even though some of the regexes on this page are capable of doing that" – Martin Thoma Sep 17 '19 at 21:12
  • @MartinThoma And then doesn't specify which ones. That seems like an odd link to provide for a use case that the page specifically says it's not covering – Michael Mrozek Sep 17 '19 at 21:36
  • I've tried to write a regex also, even I was thinking I could be inspire from the `InternalIsWellFormedOriginalString()` source code but I gave up after seeing and reading that many different cases from [dotnet](https://github.com/microsoft/referencesource/blob/5697c29004a34d80acdaf5742d7e699022c64ecd/System/net/System/UriExt.cs#L479) in base and derived classes :) – gurkan May 29 '21 at 12:53

56 Answers56

428

I wrote my URL (actually IRI, internationalized) pattern to comply with RFC 3987 (http://www.faqs.org/rfcs/rfc3987.html). These are in PCRE syntax.

For absolute IRIs (internationalized):

/^[a-z](?:[-a-z0-9\+\.])*:(?:\/\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:])*@)?(?:\[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,1}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+\.[-a-z0-9\._~!\$&'\(\)\*\+,;=:]+)\]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=])*)(?::[0-9]*)?(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|\/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*)?|(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])))(?:\?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}\/\?])*)?(?:\#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\/\?])*)?$/i

To also allow relative IRIs:

/^(?:[a-z](?:[-a-z0-9\+\.])*:(?:\/\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:])*@)?(?:\[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,1}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+\.[-a-z0-9\._~!\$&'\(\)\*\+,;=:]+)\]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=])*)(?::[0-9]*)?(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|\/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*)?|(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])))(?:\?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}\/\?])*)?(?:\#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\/\?])*)?|(?:\/\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:])*@)?(?:\[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,1}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+\.[-a-z0-9\._~!\$&'\(\)\*\+,;=:]+)\]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=])*)(?::[0-9]*)?(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|\/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*)?|(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=@])+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])))(?:\?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}\/\?])*)?(?:\#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\/\?])*)?)$/i

How they were compiled (in PHP):

<?php

/* Regex convenience functions (character class, non-capturing group) */
function cc($str, $suffix = '', $negate = false) {
    return '[' . ($negate ? '^' : '') . $str . ']' . $suffix;
}
function ncg($str, $suffix = '') {
    return '(?:' . $str . ')' . $suffix;
}

/* Preserved from RFC3986 */

$ALPHA = 'a-z';
$DIGIT = '0-9';
$HEXDIG = $DIGIT . 'a-f';

$sub_delims = '!\\$&\'\\(\\)\\*\\+,;=';
$gen_delims = ':\\/\\?\\#\\[\\]@';
$reserved = $gen_delims . $sub_delims;
$unreserved = '-' . $ALPHA . $DIGIT . '\\._~';

$pct_encoded = '%' . cc($HEXDIG) . cc($HEXDIG);

$dec_octet = ncg(implode('|', array(
    cc($DIGIT),
    cc('1-9') . cc($DIGIT),
    '1' . cc($DIGIT) . cc($DIGIT),
    '2' . cc('0-4') . cc($DIGIT),
    '25' . cc('0-5')
)));

$IPv4address = $dec_octet . ncg('\\.' . $dec_octet, '{3}');

$h16 = cc($HEXDIG, '{1,4}');
$ls32 = ncg($h16 . ':' . $h16 . '|' . $IPv4address);

$IPv6address = ncg(implode('|', array(
    ncg($h16 . ':', '{6}') . $ls32,
    '::' . ncg($h16 . ':', '{5}') . $ls32,
    ncg($h16, '?') . '::' . ncg($h16 . ':', '{4}') . $ls32,
    ncg($h16 . ':' . $h16, '?') . '::' . ncg($h16 . ':', '{3}') . $ls32,
    ncg(ncg($h16 . ':', '{0,2}') . $h16, '?') . '::' . ncg($h16 . ':', '{2}') . $ls32,
    ncg(ncg($h16 . ':', '{0,3}') . $h16, '?') . '::' . $h16 . ':' . $ls32,
    ncg(ncg($h16 . ':', '{0,4}') . $h16, '?') . '::' . $ls32,
    ncg(ncg($h16 . ':', '{0,5}') . $h16, '?') . '::' . $h16,
    ncg(ncg($h16 . ':', '{0,6}') . $h16, '?') . '::',
)));

$IPvFuture = 'v' . cc($HEXDIG, '+') . cc($unreserved . $sub_delims . ':', '+');

$IP_literal = '\\[' . ncg(implode('|', array($IPv6address, $IPvFuture))) . '\\]';

$port = cc($DIGIT, '*');

$scheme = cc($ALPHA) . ncg(cc('-' . $ALPHA . $DIGIT . '\\+\\.'), '*');

/* New or changed in RFC3987 */

$iprivate = '\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}';

$ucschar = '\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}' .
    '\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}' .
    '\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}' .
    '\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}' .
    '\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}' .
    '\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}';

$iunreserved = '-' . $ALPHA . $DIGIT . '\\._~' . $ucschar;

$ipchar = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . ':@'));

$ifragment = ncg($ipchar . '|' . cc('\\/\\?'), '*');

$iquery = ncg($ipchar . '|' . cc($iprivate . '\\/\\?'), '*');

$isegment_nz_nc = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . '@'), '+');
$isegment_nz = ncg($ipchar, '+');
$isegment = ncg($ipchar, '*');

$ipath_empty = '(?!' . $ipchar . ')';
$ipath_rootless = ncg($isegment_nz) . ncg('\\/' . $isegment, '*');
$ipath_noscheme = ncg($isegment_nz_nc) . ncg('\\/' . $isegment, '*');
$ipath_absolute = '\\/' . ncg($ipath_rootless, '?'); // Spec says isegment-nz *( "/" isegment )
$ipath_abempty = ncg('\\/' . $isegment, '*');

$ipath = ncg(implode('|', array(
    $ipath_abempty,
    $ipath_absolute,
    $ipath_noscheme,
    $ipath_rootless,
    $ipath_empty
))) . ')';

$ireg_name = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . '@'), '*');

$ihost = ncg(implode('|', array($IP_literal, $IPv4address, $ireg_name)));
$iuserinfo = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . ':'), '*');
$iauthority = ncg($iuserinfo . '@', '?') . $ihost . ncg(':' . $port, '?');

$irelative_part = ncg(implode('|', array(
    '\\/\\/' . $iauthority . $ipath_abempty . '',
    '' . $ipath_absolute . '',
    '' . $ipath_noscheme . '',
    '' . $ipath_empty . ''
)));

$irelative_ref = $irelative_part . ncg('\\?' . $iquery, '?') . ncg('\\#' . $ifragment, '?');

$ihier_part = ncg(implode('|', array(
    '\\/\\/' . $iauthority . $ipath_abempty . '',
    '' . $ipath_absolute . '',
    '' . $ipath_rootless . '',
    '' . $ipath_empty . ''
)));

$absolute_IRI = $scheme . ':' . $ihier_part . ncg('\\?' . $iquery, '?');

$IRI = $scheme . ':' . $ihier_part . ncg('\\?' . $iquery, '?') . ncg('\\#' . $ifragment, '?');

$IRI_reference = ncg($IRI . '|' . $irelative_ref);

Edit 7 March 2011: Because of the way PHP handles backslashes in quoted strings, these are unusable by default. You'll need to double-escape backslashes except where the backslash has a special meaning in regex. You can do that this way:

$escape_backslash = '/(?<!\\)\\(?![\[\]\\\^\$\.\|\*\+\(\)QEnrtaefvdwsDWSbAZzB1-9GX]|x\{[0-9a-f]{1,4}\}|\c[A-Z]|)/';
$absolute_IRI = preg_replace($escape_backslash, '\\\\', $absolute_IRI);
$IRI = preg_replace($escape_backslash, '\\\\', $IRI);
$IRI_reference = preg_replace($escape_backslash, '\\\\', $IRI_reference);
iTakeshi
  • 66
  • 1
  • 8
eyelidlessness
  • 58,600
  • 11
  • 86
  • 93
  • 80
    If you think that's bad, you should see the one for e-mail: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html – Peter Di Cecco Jan 06 '10 at 19:27
  • 12
    @Gumbo, it's allowed in the spec and used in URI implementations for HTTP applications. It's discouraged (for obvious reasons) but perfectly valid and should be anticipated. Most (if not all?) browsers sometimes translate HTTP authentication into the URL for subsequent access. – eyelidlessness Jul 08 '10 at 15:05
  • 12
    @Devin, in a function in what language? I compiled it in PHP, but it can be used in other languages. Should I write a function in all of those languages? Alternately, it would be pretty simple for you to do the same in a language of your choosing. – eyelidlessness Oct 17 '11 at 00:26
  • 2
    Perhaps I post a question specifically on wrapping your code in functions in various languages? I think that would keep things organized. – Devin Rhode Oct 17 '11 at 01:10
  • 1
    Thanks for posting this answer - the only thing I'm finding is that in RegexBuddy it's not working; things like `\x{10000}-\x{1FFFD}` are causing trouble. Any ideas? – joshcomley Nov 22 '11 at 15:40
  • 6
    @joshcomley replace \x{ABCD} to \uABCD, if you write it in JS – bruha Feb 13 '12 at 01:51
  • 4
    Yes, `http://com` is a valid URL. `http://localhost` is, why wouldn't other words be? You are correct that the `u` modifier is necessary in PHP. I want to be clear that while I generated these with PHP, they are not meant to be PHP-specific. – eyelidlessness Nov 22 '13 at 17:18
  • 2
    This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Common Validation Tasks". – aliteralmind Apr 10 '14 at 01:18
  • 1
    @eyelidlessness This regex erroneously allows `|` in querystrings. Eg `http://foo.com?a=|` matches. I think it's because of the stray `|` in `$iprivate` – Hans Jun 23 '15 at 22:19
  • @Hans, why wouldn't `|` be allowed in a query string? It is. It matches the `%xF0000-FFFFD` range of `iprivate`. The `|` in the `$iprivate` variable is a regex special character that means `OR`. See http://www.regular-expressions.info/refquick.html – eyelidlessness Jun 24 '15 at 04:42
  • 1
    @eyelidlessness Per [RFC 3987 IRI Syntax](https://tools.ietf.org/html/rfc3987#section-2) unicode char VERTICAL LINE u+007C is not allowed anywhere in IRI's at all, in fact. The `|` in `$iprivate` represents a literal, NOT an alternation operator, since it's enclosed in a character class. – Hans Jun 24 '15 at 13:07
  • @Hans, while I see now that the `|` is treated as a literal as you say, I see nothing in the spec disallowing the character anywhere, and it does in fact match the `iprivate` range `%xF0000-FFFFD`. The pipe is allowed, and there's no reason it shouldn't be. – eyelidlessness Jun 26 '15 at 05:36
  • 2
    @eyelidlessness Why do you think `u+007c` is in the range `u+F0000-u+FFFFD`? If you need further convincing, just test `/[\x{F0000}-\x{FFFFD}]/u` against `|` to observe that it does not match. If still not convinced, take a look at IRI validators across various languages such as [Python's rfc3987 package](https://pypi.python.org/pypi/rfc3987/) or [.NET's Uri.IsWellFormedUriString method with IRI support enabled](https://msdn.microsoft.com/en-us/library/system.uri.iswellformeduristring(v=vs.110).aspx). None of them allow for `|`. See sample results [here](http://i.imgur.com/elkjwVJ.png) – Hans Jun 26 '15 at 19:51
  • 2
    @Hans, I apologize, you are correct. I was very quickly trying to verify by converting the pattern to JS to test in my console, because I don't have a PHP environment to test in anymore. But I was not paying attention to converting the character classes correctly. I guess I was surprised because there's really no obvious reason that a pipe would be disallowed. Thanks for the correction. – eyelidlessness Jun 26 '15 at 21:58
  • 2
    @eyelidlessness No worries. Thanks for updating the answer. BTW, [regex101.com](https://regex101.com/) is an excellent tool for testing both pcre and js regex's. – Hans Jun 26 '15 at 22:25
  • I spent half a minute trying to select the first regex to copy it before I realized that I could triple-click on it. – GalaxyCat105 Sep 17 '20 at 21:14
  • can someone copy the regex here into a pastebin or something? For some reason I cannot copy the IRI regexes above and get a valid copy. – jaaq Oct 06 '20 at 09:36
  • 1
    @jaaq Try triple-clicking on the IRI regex to select it all. Here's a paste: https://pastebin.com/9i7FSQ23. And for relative URIs: https://pastebin.com/qyv6gmQe – mbomb007 Nov 11 '20 at 19:18
  • @mbomb007 wow thx, I never tried triple-clicking o.O also thanks for the paste :) – jaaq Nov 11 '20 at 21:31
  • For ireg-name, the spec says `ireg-name = *( iunreserved / pct-encoded / sub-delims )` but the code uses `$ireg_name = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . '@'), '*');` The code adds `'@'`. Which is correct? I suspect the spec is correct. – AtesComp May 12 '21 at 15:28
162

I've just written up a blog post for a great solution for recognizing URLs in most used formats such as:

  • www.google.com
  • http://www.google.com
  • mailto:somebody@google.com
  • somebody@google.com
  • www.url-with-querystring.com/?url=has-querystring

The regular expression used is:

/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)/
Community
  • 1
  • 1
Matthew O'Riordan
  • 7,503
  • 3
  • 41
  • 56
  • This is good, but fails in a few spots (as @Matthew ack's in his comments in his blog): beta.foobar.com and goo.gl and bit.ly – cmroanirgo Dec 01 '12 at 07:11
  • 23
    That one also works, but it's missing support for the port number (useful in debugging). Modified would be `/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)/` – Jaime Cham Mar 15 '13 at 08:58
  • Without http:// before it, the above does recognise the URL of this page. Depends on expected usage but strikes me as weak in many cases. – Oliver Moran May 05 '13 at 20:17
  • 2
    This Regex doesn't handle links with parenthesis in them: e.g. msdn.microsoft.com/en-us/library/ms563775(v=office.14).aspx – RobH Jul 10 '13 at 09:28
  • 4
    Shouldn't the dot be escaped after www? – Anthony Aug 08 '13 at 17:04
  • Perhaps they run their code against unit tests, and those unit tests contain strings they thought up which look like URLs but aren't? – machineghost Jan 31 '14 at 00:01
  • 16
    Got another match mate: `width:210px;` and `margin:3px` – Cas Bloem Feb 07 '14 at 15:29
  • another match: "dot:." the main search it is looking for is 3-9 characters, followed by a colon ":" then any character or number from a-z && 0-9 + others :) @whiteb0x if you are looking for a regex validator [check here](http://regex101.com/#PHP). – Steve P Mar 25 '14 at 10:10
  • Can this be easily modified to avoid matching a trailing dot, as in "www.test.com."? – Imran Apr 03 '15 at 20:53
  • 2
    Doesn't match "example.com"...? – Gustav Jun 27 '15 at 18:07
  • Big issue with this one as it also matches javascript:alert(0); – Chad Brown May 31 '16 at 21:45
  • It also match the email address pattern i.e abc@xyz.com – basit raza Dec 29 '16 at 06:40
  • This regex does not work for URLs that contain a comma For example: http://www.example.com/todo.do?id=123,456 – Salem Artin Jan 26 '18 at 22:21
  • Link broken fix please – Mr_and_Mrs_D Feb 23 '18 at 12:45
  • This regex also matches `string:string`. So it will match `test:P` or `test:DDd`. Should probably fix that.. – icecub Oct 15 '19 at 06:43
  • if i provide `omaraslam.com:6060` it only catches the `.com:POSRTNUMBER` – Reborn Aug 27 '20 at 09:25
  • try this URL, the end of "{zip}" is not calculated https://fakedomain.fake.ddd.com/?f_fs=tzxzlattxxxixzxqtb39&flux_cost=.040&sui=60_2_414_15_1&p=76242&e=pb@pb.com&fn=Ajeya&ln=Barua&z={zip} – developer learn999 Oct 21 '20 at 14:13
82

What platform? If using .NET, use System.Uri.TryCreate, not a regex.

For example:

static bool IsValidUrl(string urlString)
{
    Uri uri;
    return Uri.TryCreate(urlString, UriKind.Absolute, out uri)
        && (uri.Scheme == Uri.UriSchemeHttp
         || uri.Scheme == Uri.UriSchemeHttps
         || uri.Scheme == Uri.UriSchemeFtp
         || uri.Scheme == Uri.UriSchemeMailto
            /*...*/);
}

// In test fixture...

[Test]
void IsValidUrl_Test()
{
    Assert.True(IsValidUrl("http://www.example.com"));
    Assert.False(IsValidUrl("javascript:alert('xss')"));
    Assert.False(IsValidUrl(""));
    Assert.False(IsValidUrl(null));
}

(Thanks to @Yoshi for the tip about javascript:)

Community
  • 1
  • 1
Duncan Smart
  • 27,805
  • 8
  • 60
  • 69
  • how would you use system.uri to check for a valid url? – dev.e.loper Mar 31 '09 at 17:15
  • 7
    Uri.TryCreate() returns true if it's valid – Duncan Smart Apr 01 '09 at 09:03
  • 122
    A HUGE warning to anyone who uses this technique: System.Uri correctly accepts `javascript: alert('blah')`. You need to do further validation on [Uri.Scheme](http://msdn.microsoft.com/en-us/library/system.uri.scheme.aspx) to confirm the http/https/ftp protocol is being used, otherwise if such a URL is inserted into your ASP.NET pages' HTML as a link, **your users are vulnerable to XSS attacks**. – Yoshi Aug 10 '11 at 05:25
  • 23
    Notably, Uri.TryCreate returns true for empty strings as well. It appears that TryCreate isn't very effective... – Steven Evers May 09 '12 at 14:26
  • 1
    what if I need a regex to do server/client-side in an ASP.NET MVC app? How would this help me on the client? – Andrei Rînea May 30 '13 at 15:56
  • 4
    For .Net, use `Uri.IsWellFormedUriString()` – mheyman Aug 23 '15 at 18:12
  • -1 for a few reasons: 1. Doesn't answer the original question, at all. 2. Puts an undue amount of faith in a black box system that according to these comments doesn't even work remotely as well as the regex examples provided. – rw-nandemo May 16 '16 at 18:14
  • 1
    @rw-nandemo: I would argument a *mile-long* RegEx-String isn't less of a black box than an official and probably thoroughly real-world-tested .NET API. (Also see [Andy Lester's answer below](http://stackoverflow.com/a/18724720/2822719)) – Marcus Mangelsdorf Feb 05 '17 at 17:02
  • @DuncanSmart I like your approach. Also, it would be interesting to go even further and make it event simpler and language agnostic. If you only accept public URLs and only if they are available, it would be interesting to try to do an actual request to the URL and if everything is fine to accept it, otherwise it means is not valid. – Alin Ciocan Jul 07 '17 at 12:02
61

Here's what RegexBuddy uses.

(\b(https?|ftp|file)://)?[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]

It matches these below (inside the ** ** marks):

**http://www.regexbuddy.com**  
**http://www.regexbuddy.com/**  
**http://www.regexbuddy.com/index.html**  
**http://www.regexbuddy.com/index.html?source=library**  

You can download RegexBuddy at http://www.regexbuddy.com/download.html.

Keng
  • 48,571
  • 31
  • 77
  • 109
  • 31
    What about gopher? Poor, forgotten gopher. – toohool Oct 02 '08 at 18:00
  • 3
    Your regex doesn't match any url I can come up with - including those you've included. I paste your regex into http://www.rubular.com and it says "Forward slashes must be escaped." Is there a typo or can you clarify by getting it to work at rubular.com? – PandaWood Nov 13 '10 at 07:18
  • 3
    @PandaWood that's because you need to format for Ruby. What is Ruby's escape character? – Keng Nov 15 '10 at 14:39
  • Hi Keng, even if I copy your exact RegEx above into RegexBuddy, I can't match it on any URL. I guess there's something gone amiss in the markup. Ruby regex is hardly any different at this basic syntax level. – PandaWood Nov 22 '10 at 01:28
  • @PandaWood wait...if you have REB just go to the library and grab it. that's where i got it...check to see if they are the same. – Keng Nov 22 '10 at 04:26
  • 19
    As a JavaScript RegExp literal: `/\b(https?|ftp|file):\/\/[\-A-Za-z0-9+&@#\/%?=~_|!:,.;]*[\-A-Za-z0-9+&@#\/%=~_|]/` – jpillora Jan 16 '13 at 00:16
  • @Mahesh Chand thanks for the update; the edit got rejected so I couldnt get a moderator to reinstate the enhancement. I think it got rejected because the reviewers would rather see code changes in comments and then let the OP add it. I made the update though. thanks. – Keng Mar 21 '14 at 18:26
  • The good thing about this regex is that it matches URLs with commas. For example: http://www.example.com/todo.do?id=123,456 – Salem Artin Jan 26 '18 at 22:22
  • This regex not only does not match many valid URIs, but also matches anything like `[-A-Za-z0-9+&@#/%?=~_|!:,.;]`, which is of course nothing like a URI. I suggest deletion. – Michael Foukarakis May 30 '18 at 08:30
  • 3
    This matches nearly everything... useless – Teejay Sep 17 '18 at 10:42
  • 1
    @toohool, at least gopher lasted longer than archie. :-\ (Do people still finger? ) – Synetech Sep 09 '19 at 17:21
48

With regard to eyelidness' answer post that reads "This is based on my reading of the URI specification.": Thanks Eyelidness, yours is the perfect solution I sought, as it is based on the URI spec! Superb work. :)

I had to make two amendments. The first to get the regexp to match IP address URLs correctly in PHP (v5.2.10) with the preg_match() function.

I had to add one more set of parenthesis to the line above "IP Address" around the pipes:

)|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}(?#

Not sure why.

I have also reduced the top level domain minimum length from 3 to 2 letters to support .co.uk and similar.

Final code:

/^(https?|ftp):\/\/(?#                                      protocol
)(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+(?#         username
)(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?(?#      password
)@)?(?#                                                     auth requires @
)((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*(?#             domain segments AND
)[a-z][a-z0-9-]*[a-z0-9](?#                                 top level domain  OR
)|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}(?#
    )(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])(?#             IP address
))(:\d+)?(?#                                                port
))(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*(?# path
)(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)(?#      query string
)?)?)?(?#                                                   path and query string optional
)(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?(?#      fragment
)$/i

This modified version was not checked against the URI specification so I can't vouch for it's compliance, it was altered to handle URLs on local network environments and two digit TLDs as well as other kinds of Web URL, and to work better in the PHP setup I use.

As PHP code:

define('URL_FORMAT', 
'/^(https?):\/\/'.                                         // protocol
'(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'.         // username
'(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'.      // password
'@)?(?#'.                                                  // auth requires @
')((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*'.                      // domain segments AND
'[a-z][a-z0-9-]*[a-z0-9]'.                                 // top level domain  OR
'|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}'.
'(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'.                 // IP address
')(:\d+)?'.                                                // port
')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'. // path
'(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'.      // query string
'?)?)?'.                                                   // path and query string optional
'(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'.      // fragment
'$/i');

Here is a test program in PHP which validates a variety of URLs using the regex:

<?php

define('URL_FORMAT',
'/^(https?):\/\/'.                                         // protocol
'(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'.         // username
'(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'.      // password
'@)?(?#'.                                                  // auth requires @
')((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*'.                      // domain segments AND
'[a-z][a-z0-9-]*[a-z0-9]'.                                 // top level domain  OR
'|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}'.
'(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'.                 // IP address
')(:\d+)?'.                                                // port
')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'. // path
'(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'.      // query string
'?)?)?'.                                                   // path and query string optional
'(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'.      // fragment
'$/i');

/**
 * Verify the syntax of the given URL. 
 * 
 * @access public
 * @param $url The URL to verify.
 * @return boolean
 */
function is_valid_url($url) {
  if (str_starts_with(strtolower($url), 'http://localhost')) {
    return true;
  }
  return preg_match(URL_FORMAT, $url);
}


/**
 * String starts with something
 * 
 * This function will return true only if input string starts with
 * niddle
 * 
 * @param string $string Input string
 * @param string $niddle Needle string
 * @return boolean
 */
function str_starts_with($string, $niddle) {
      return substr($string, 0, strlen($niddle)) == $niddle;
}


/**
 * Test a URL for validity and count results.
 * @param url url
 * @param expected expected result (true or false)
 */

$numtests = 0;
$passed = 0;

function test_url($url, $expected) {
  global $numtests, $passed;
  $numtests++;
  $valid = is_valid_url($url);
  echo "URL Valid?: " . ($valid?"yes":"no") . " for URL: $url. Expected: ".($expected?"yes":"no").". ";
  if($valid == $expected) {
    echo "PASS\n"; $passed++;
  } else {
    echo "FAIL\n";
  }
}

echo "URL Tests:\n\n";

test_url("http://localserver/projects/public/assets/javascript/widgets/UserBoxMenu/widget.css", true);
test_url("http://www.google.com", true);
test_url("http://www.google.co.uk/projects/my%20folder/test.php", true);
test_url("https://myserver.localdomain", true);
test_url("http://192.168.1.120/projects/index.php", true);
test_url("http://192.168.1.1/projects/index.php", true);
test_url("http://projectpier-server.localdomain/projects/public/assets/javascript/widgets/UserBoxMenu/widget.css", true);
test_url("https://2.4.168.19/project-pier?c=test&a=b", true);
test_url("https://localhost/a/b/c/test.php?c=controller&arg1=20&arg2=20", true);
test_url("http://user:password@localhost/a/b/c/test.php?c=controller&arg1=20&arg2=20", true);

echo "\n$passed out of $numtests tests passed.\n\n";

?>

Thanks again to eyelidness for the regex!

meager
  • 209,754
  • 38
  • 307
  • 315
  • 1
    eyelidness' answer didn't work for me, but this one did. Thanks! – Josh Mar 27 '12 at 20:22
  • this one works in JavaScript, but I was *not* able to get the one eyelidness provided to work in JS, even after replacing \x with \u to escape unicode characters – jimmym715 Aug 10 '12 at 19:47
  • 5
    [Sho Kuwamoto](http://stackoverflow.com/users/957663/sho-kuwamoto)'s comment: "I ended up using the regex by user244966, which to me is the perfect blend of readable but thorough. However, there is one MAJOR issue in the regex.... His/her regex fails on domains that contain one character pieces, such as http://t.co The fix is to replace this line `')((([a-z0-9][a-z0-9-]*[a-z0-9]\.)*'.` with `')((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*'.`." I've made the relevant edit based on this comment. – Peter O. Oct 24 '12 at 12:15
  • Works beautifully! Anyway I just allowed myself to add support for paths with the tilde character (~), by adding it into the line corresponding to path. – Leo supports Monica Cellio Mar 01 '13 at 21:32
  • `/^(https?|ftp):` (protocol) Why do you disallow protocols like data, file, svn, dc++, magnet, skype or any other supported by a browser having the corresponding plugin or a server? – Aleksey F. Nov 11 '15 at 01:19
47

Mathias Bynens has a great article on the best comparison of a lot of regular expressions: In search of the perfect URL validation regex

The best one posted is a little long, but it matches just about anything you can throw at it.

JavaScript version

/^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,}))\.?)(?::\d{2,5})?(?:[/?#]\S*)?$/i

PHP version

_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]-*)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]-*)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,}))\.?)(?::\d{2,5})?(?:[/?#]\S*)?$_iuS
nhahtdh
  • 52,949
  • 15
  • 113
  • 149
Kiril
  • 37,748
  • 29
  • 161
  • 218
  • 1
    For preg_match use with PHP use `%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu` – Toby Beresford Oct 05 '16 at 13:59
  • On that page, I prefer stephenhay's solution, because it's 38 chars instead of 502! – Venryx Apr 29 '17 at 01:16
  • Also doesn't allow for IP addresses – Matt Fletcher Jan 23 '19 at 15:15
  • give valid (slash slash) : //www.2test.com/ – stackdave Feb 12 '19 at 16:55
  • I tested some JavaScript regular expression URL testers. The above Kril/nhahtdh tester came out the best, with no false negatives and only one false positive, namely http://www.foo.bar./. Interestingly, the Diego Perini original has the same error. Test results posted at http://pagenotes.com/url%20tester.htm – Page Notes Feb 21 '21 at 19:09
35

The post Getting parts of a URL (Regex) discusses parsing a URL to identify its various components. If you want to check if a URL is well-formed, it should be sufficient for your needs.

If you need to check if it's actually valid, you'll eventually have to try to access whatever's on the other end.

In general, though, you'd probably be better off using a function that's supplied to you by your framework or another library. Many platforms include functions that parse URLs. For example, there's Python's urlparse module, and in .NET you could use the System.Uri class's constructor as a means of validating the URL.

Christian Geier
  • 1,849
  • 2
  • 18
  • 26
Blair Conrad
  • 202,794
  • 24
  • 127
  • 110
25

This might not be a job for regexes, but for existing tools in your language of choice. You probably want to use existing code that has already been written, tested, and debugged.

In PHP, use the parse_url function.

Perl: URI module.

Ruby: URI module.

.NET: 'Uri' class

Regexes are not a magic wand you wave at every problem that happens to involve strings.

Andy Lester
  • 81,480
  • 12
  • 93
  • 144
  • 8
    Your last sentence very much reminds me of [Law of the instrument/Maslow's hammer](https://en.wikipedia.org/wiki/Law_of_the_instrument): *"If all you have is a hammer, everything looks like a nail."* – DavidRR Sep 17 '14 at 19:57
  • 4
    Regexes are, however, beautiful for _extracting_ URLs from a body of plaintext. If you suspect the entirety of a string is a URL, then I'd 100% agree with you and mention that Java's equivalent is `java.net.URL`. – ndm13 Apr 17 '17 at 22:59
  • 4
    The docs for parse_url in PHP state: This function is not meant to validate the given URL, it only breaks it up into the above listed parts. – Doug Amos Sep 18 '18 at 07:58
19

This will match all URLs

  • with or without http/https
  • with or without www

...including sub-domains and those new top-level domain name extensions such as .museum, .academy, .foundation etc. which can have up to 63 characters (not just .com, .net, .info etc.)

(([\w]+:)?//)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,63}(:[\d]+)?(/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?

Because today maximum length of the available top-level domain name extension is 13 characters such as .international, you can change the number 63 in expression to 13 to prevent someone misusing it.

as javascript

var urlreg=/(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,63}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?/;

$('textarea').on('input',function(){
  var url = $(this).val();
  $(this).toggleClass('invalid', urlreg.test(url) == false)
});

$('textarea').trigger('input');
textarea{color:green;}
.invalid{color:red;}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea>http://www.google.com</textarea>
<textarea>http//www.google.com</textarea>
<textarea>googlecom</textarea>
<textarea>https://www.google.com</textarea>

Wikipedia Article: List of all internet top-level domains

AwokeKnowing
  • 6,390
  • 7
  • 32
  • 43
Besnik Kastrati
  • 631
  • 6
  • 5
18

Non-validating URI-reference Parser

For reference purposes, here's the IETF Spec: (TXT | HTML). In particular, Appendix B. Parsing a URI Reference with a Regular Expression demonstrates how to parse a valid regex. This is described as,

for an example of a non-validating URI-reference parser that will take any given string and extract the URI components.

Here's the regex they provide:

 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

As someone else said, it's probably best to leave this to a lib/framework you're already using.

user157251
  • 64,489
  • 38
  • 208
  • 350
Hank Gay
  • 65,372
  • 31
  • 148
  • 218
  • 16
    Completely useless. Can someone show me a string which this regex does *not* match? (Both "#?#?#" or "<<<>>>" match. What kind of URIs are those?) – Alex D Apr 13 '13 at 19:39
  • 3
    @AlexD Don't complain to me. That's the official specification for a URI. Take it up with the IETF if you don't like it. – Hank Gay Jul 18 '13 at 14:07
  • 1
    @AlexD I think those might be considered _relative references_. See RFC 3986, section 4.2. – andyg0808 Dec 13 '13 at 10:12
  • 3
    @andyg0808, you may be right, but the fact remains that this regex matches virtually any string under the sun. – Alex D Dec 13 '13 at 18:34
  • 2
    This is not a good answer because it's not validating, as per the question. It's parsing. Those are two different functions. If you give this regex trash, it tries to parse it. If the URL isn't valid, the parsing isn't guaranteed to work. – user157251 Aug 27 '18 at 05:47
  • @Evan Carroll: Anything can be parsed according to some criteria. Feed any regex on this page with a string, and where it doesn't parse to a valid URL, it's an invalid URL by assertion. Then trialing the result validates the regex assertion. You're right, the answer says _Non-validating URI-reference Parser_ "for reference purposes", which might be included in an answer to something like [this thread](https://stackoverflow.com/questions/3487089/are-regular-expressions-used-to-build-parsers), and then cross-linked. – Laurie Stearn Apr 10 '19 at 10:04
  • @AlexD According to Python's `urllib.parse.urlparse()`, an entirely valid URI: `ParseResult(scheme='', netloc='', path='', params='', query='', fragment='?#?#')`. Just because it's useless doesn't mean it's invalid. – Wayne Werner Mar 19 '20 at 21:54
12

The best regular expression for URL for me would be:

"(([\w]+:)?//)?(([\d\w]|%[a-fA-F\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?"
ndm13
  • 1,087
  • 14
  • 19
S.p
  • 999
  • 3
  • 13
  • 27
  • this seems to be limited w/r/t number of domains it'll accept? – rektide Feb 02 '14 at 22:25
  • 2
    Thanks! Here's the escaped version that worked for me on iOS: `(([\\w]+:)?//)?(([\\d\\w]|%[a-fA-f\\d]{2,2})+(:([\\d\\w]|%[a-fA-f\\d]{2,2})+)?@)?([\\d\\w][-\\d\\w]{0,253}[\\d\\w]\\.)+[\\w]{2,4}(:[\\d]+)?(/([-+_~.\\d\\w]|%[a-fA-f\\d]{2,2})*)*(\\?(&?([-+_~.\\d\\w]|%[a-fA-f\\d]{2,2})=?)*)?(#([-+_~.\\d\\w]|%[a-fA-f\\d]{2,2})*)?` – James Kuang Feb 03 '14 at 23:19
  • This regex only matches suffixes up to 4 characters long and fails on IP addresses (v4 and v6), localhost, and domain names with foreign characters. I would recommend editing your inclusion size ranges and replacing `\w` with `\p{L}` at a minimum. – ndm13 May 05 '17 at 20:25
  • Note that this RegEx doesn't capture URLs that have subdomains of one letter only, like **"http://m.sitename.com"**. In order to fix that, I had to change `([\d\w][-\d\w]{0,253}[\d\w]\.)+` into `([\d\w][-\d\w]{0,253}[\d\w]?\.)+` (add a question mark near the end of it) – Yoav Feuerstein Aug 31 '17 at 03:59
9
        function validateURL(textval) {
            var urlregex = new RegExp(
            "^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&amp;%\$\-]+)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&amp;%\$#\=~_\-]+))*$");
            return urlregex.test(textval);
        }

Matches http://site.com/dir/file.php?var=moo | ftp://user:pass@site.com:21/file/dir

Non-Matches site.com | http://site.com/dir//

LifeInstructor
  • 1,498
  • 1
  • 18
  • 24
8

I was not able to find the regex I was looking for so I modified a regex to fullfill my requirements, and apparently it seems to work fine now. My requirements were:

Here what I came up with, any suggestion is appreciated:

@Test
    public void testWebsiteUrl(){
        String regularExpression = "((http|ftp|https):\\/\\/)?[\\w\\-_]+(\\.[\\w\\-_]+)+([\\w\\-\\.,@?^=%&amp;:/~\\+#]*[\\w\\-\\@?^=%&amp;/~\\+#])?";

        assertTrue("www.google.com".matches(regularExpression));
        assertTrue("www.google.co.uk".matches(regularExpression));
        assertTrue("http://www.google.com".matches(regularExpression));
        assertTrue("http://www.google.co.uk".matches(regularExpression));
        assertTrue("https://www.google.com".matches(regularExpression));
        assertTrue("https://www.google.co.uk".matches(regularExpression));
        assertTrue("google.com".matches(regularExpression));
        assertTrue("google.co.uk".matches(regularExpression));
        assertTrue("google.mu".matches(regularExpression));
        assertTrue("mes.intnet.mu".matches(regularExpression));
        assertTrue("cse.uom.ac.mu".matches(regularExpression));

        assertTrue("http://www.google.com/path".matches(regularExpression));
        assertTrue("http://subdomain.web-site.com/cgi-bin/perl.cgi?key1=value1&key2=value2e".matches(regularExpression));
        assertTrue("http://www.google.com/?queryparam=123".matches(regularExpression));
        assertTrue("http://www.google.com/path?queryparam=123".matches(regularExpression));

        assertFalse("www..dr.google".matches(regularExpression));

        assertFalse("www:google.com".matches(regularExpression));

        assertFalse("https://www@.google.com".matches(regularExpression));

        assertFalse("https://www.google.com\"".matches(regularExpression));
        assertFalse("https://www.google.com'".matches(regularExpression));

        assertFalse("http://www.google.com/path'".matches(regularExpression));
        assertFalse("http://subdomain.web-site.com/cgi-bin/perl.cgi?key1=value1&key2=value2e'".matches(regularExpression));
        assertFalse("http://www.google.com/?queryparam=123'".matches(regularExpression));
        assertFalse("http://www.google.com/path?queryparam=12'3".matches(regularExpression));

    }
thermz
  • 2,237
  • 2
  • 18
  • 27
  • +1, love when people add test cases; it is so easy to eyeball rather than trying to decipher the regex on the fly. – Dawid O Oct 11 '19 at 15:07
7
function validateURL(textval) {
            var urlregex = new RegExp(
            "^(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~])*$");
            return urlregex.test(textval);
        }

Matches http://www.asdah.com/~joe | ftp://ftp.asdah.co.uk:2828/asdah%20asdah.gif | https://asdah.gov/asdh-ah.as

LifeInstructor
  • 1,498
  • 1
  • 18
  • 24
7

If you really search for the ultimate match, you probably find it on "A Good Url Regular Expression?".

But a regex that really matches all possible domains and allows anything that is allowed according to RFCs is horribly long and unreadable, trust me ;-)

the Tin Man
  • 150,910
  • 39
  • 198
  • 279
Mecki
  • 106,869
  • 31
  • 201
  • 225
7

Here is a good rule that covers all possible cases: ports, params and etc

/(https?:\/\/(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9])(:?\d*)\/?([a-z_\/0-9\-#.]*)\??([a-z_\/0-9\-#=&]*)/g
Dmytro Huz
  • 958
  • 8
  • 18
6

I wrote a little groovy version that you can run

it matches the following URLs (which is good enough for me)

public static void main(args) {
    String url = "go to http://www.m.abut.ly/abc its awesome"
    url = url.replaceAll(/https?:\/\/w{0,3}\w*?\.(\w*?\.)?\w{2,3}\S*|www\.(\w*?\.)?\w*?\.\w{2,3}\S*|(\w*?\.)?\w*?\.\w{2,3}[\/\?]\S*/ , { it ->
        "woof${it}woof"
    })
    println url 
}
http://google.com
http://google.com/help.php
http://google.com/help.php?a=5

http://www.google.com
http://www.google.com/help.php
http://www.google.com?a=5

google.com?a=5
google.com/help.php
google.com/help.php?a=5

http://www.m.google.com/help.php?a=5 (and all its permutations)
www.m.google.com/help.php?a=5 (and all its permutations)
m.google.com/help.php?a=5 (and all its permutations)

The important thing for any URLs that don't start with http or www is that they must include a / or ?

I bet this can be tweaked a little more but it does the job pretty nice for being so short and compact... because you can pretty much split it in 3:

find anything that starts with http:

https?:\/\/w{0,3}\w*?\.\w{2,3}\S*

find anything that starts with www:

www\.\w*?\.\w{2,3}\S*

or find anything that must have a text then a dot then at least 2 letters and then a ? or /:

\w*?\.\w{2,3}[\/\?]\S*
Dane Brouwer
  • 1,974
  • 1
  • 17
  • 21
6
^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$

live demo: https://regex101.com/r/HUNasA/2

I have tested various expressions to match my requirements.

As a user I can hit browser search bar with following strings:

valid urls

invalid urls

Nodarii
  • 774
  • 5
  • 20
5

I use this regex:

((https?:)?//)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,63}(:[\d]+)?(/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?

To support both:

http://stackoverflow.com
https://stackoverflow.com

And:

//stackoverflow.com
Mikael Engver
  • 4,078
  • 4
  • 40
  • 51
  • 2
    I had to update your regex. The third '?' was allowing all sorts of text to be selected. After removing it only 'http', 'https', or '//' were selected. I modified this so it works on relative URLs to '/'. And escaped the forward slashes. `((https?:)?(\/?\/))(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,63}(:[\d]+)?(/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?` – Markus Aug 28 '14 at 12:52
  • 1
    Updated the capturing groups so they can be more useful: `((?:https?:)?(?:\/?\/))((?:[\d\w]|%[a-fA-f\d]{2,2})+(?::(?:[\d\w]|%[a-fA-f\d]{2,2})+)?@)?((?:[\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,63})(:[\d]+)?(\/(?:[-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(?:&?(?:[-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#(?:[-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?` – panec Jan 05 '18 at 17:08
5

I've been working on an in-depth article discussing URI validation using regular expressions. It is based on RFC3986.

Regular Expression URI Validation

Although the article is not yet complete, I have come up with a PHP function which does a pretty good job of validating HTTP and FTP URLs. Here is the current version:

// function url_valid($url) { Rev:20110423_2000
//
// Return associative array of valid URI components, or FALSE if $url is not
// RFC-3986 compliant. If the passed URL begins with: "www." or "ftp.", then
// "http://" or "ftp://" is prepended and the corrected full-url is stored in
// the return array with a key name "url". This value should be used by the caller.
//
// Return value: FALSE if $url is not valid, otherwise array of URI components:
// e.g.
// Given: "http://www.jmrware.com:80/articles?height=10&width=75#fragone"
// Array(
//    [scheme] => http
//    [authority] => www.jmrware.com:80
//    [userinfo] =>
//    [host] => www.jmrware.com
//    [IP_literal] =>
//    [IPV6address] =>
//    [ls32] =>
//    [IPvFuture] =>
//    [IPv4address] =>
//    [regname] => www.jmrware.com
//    [port] => 80
//    [path_abempty] => /articles
//    [query] => height=10&width=75
//    [fragment] => fragone
//    [url] => http://www.jmrware.com:80/articles?height=10&width=75#fragone
// )
function url_valid($url) {
    if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
    if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
    if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
        ^
        (?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
        (?P<authority>
          (?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)@)?
          (?P<host>
            (?P<IP_literal>
              \[
              (?:
                (?P<IPV6address>
                  (?:                                                (?:[0-9A-Fa-f]{1,4}:){6}
                  |                                                ::(?:[0-9A-Fa-f]{1,4}:){5}
                  | (?:                          [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}:
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
                  )
                  (?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
                  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
                  )
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::
                )
              | (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
              )
              \]
            )
          | (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                               (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
          | (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
          )
          (?::(?P<port>[0-9]*))?
        )
        (?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*)
        (?:\?(?P<query>       (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        (?:\#(?P<fragment>    (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        $
        /mx', $url, $m)) return FALSE;
    switch ($m['scheme']) {
    case 'https':
    case 'http':
        if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
        break;
    case 'ftps':
    case 'ftp':
        break;
    default:
        return FALSE;   // Unrecognized URI scheme. Default to FALSE.
    }
    // Validate host name conforms to DNS "dot-separated-parts".
    if ($m['regname']) { // If host regname specified, check for DNS conformance.
        if (!preg_match('/# HTTP DNS host name.
            ^                      # Anchor to beginning of string.
            (?!.{256})             # Overall host length is less than 256 chars.
            (?:                    # Group dot separated host part alternatives.
              [A-Za-z0-9]\.        # Either a single alphanum followed by dot
            |                      # or... part has more than one char (63 chars max).
              [A-Za-z0-9]          # Part first char is alphanum (no dash).
              [A-Za-z0-9\-]{0,61}  # Internal chars are alphanum plus dash.
              [A-Za-z0-9]          # Part last char is alphanum (no dash).
              \.                   # Each part followed by literal dot.
            )*                     # Zero or more parts before top level domain.
            (?:                    # Explicitly specify top level domains.
              com|edu|gov|int|mil|net|org|biz|
              info|name|pro|aero|coop|museum|
              asia|cat|jobs|mobi|tel|travel|
              [A-Za-z]{2})         # Country codes are exactly two alpha chars.
              \.?                  # Top level domain can end in a dot.
            $                      # Anchor to end of string.
            /ix', $m['host'])) return FALSE;
    }
    $m['url'] = $url;
    for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
    return $m; // return TRUE == array of useful named $matches plus the valid $url.
}

This function utilizes two regexes; one to match a subset of valid generic URIs (absolute ones having a non-empty host), and a second to validate the DNS "dot-separated-parts" host name. Although this function currently validates only HTTP and FTP schemes, it is structured such that it can be easily extended to handle other schemes.

ridgerunner
  • 30,685
  • 4
  • 51
  • 68
  • I'm curious why you chose to follow URI RFC3986 rather than IRI RFC3987. – eyelidlessness Nov 09 '12 at 18:21
  • @eyelidlessness - Good question. I'm not really well versed with IRIs. Thanks for pointing out that RFC. I see that according to RFC3987: _"...in the HTTP protocol [RFC2616], the Request URI is defined as a URI, which means that direct use of IRIs is not allowed in HTTP requests."_ So an IRI is actually encoded as a URI before being sent via HTTP. So for the time being, there will always be a need for URI validation. Maybe I'll tackle IRI validation at a later date. Thanks for the comment! – ridgerunner Nov 09 '12 at 23:57
  • @ridgerunner, the reference to 2616 is outdated. IRIs are sent as IRIs, with all of the characters that IRIs allow and URIs don't. I appreciate the effort to create a "human readable" pattern (and I've worked on one myself but haven't had the opportunity to test sufficiently) but in 2012 and going into 2013 it's unacceptable to limit addresses to western characters while non-western characters are in fact in wide use in paths, fragments and even domains. – eyelidlessness Nov 10 '12 at 08:42
  • @eyelidlessness - I guess I need to take a closer look into this. Thanks for the heads up. – ridgerunner Nov 10 '12 at 15:52
  • @ridgerunner, cheers! And I apologize if I came off as rude, I shouldn't comment after drinking! I do applaud the effort to make a human-readable pattern, and you have my upvote. – eyelidlessness Nov 10 '12 at 17:17
4

For Python, this is the actual URL validating regex used in Django 1.5.1:

import re
regex = re.compile(
        r'^(?:http|ftp)s?://'  # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...
        r'localhost|'  # localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'  # ...or ipv4
        r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'  # ...or ipv6
        r'(?::\d+)?'  # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

This does both ipv4 and ipv6 addresses as well as ports and GET parameters.

Found in the code here, Line 44.

Ewan
  • 12,912
  • 3
  • 45
  • 55
4

This one works for me very well. (https?|ftp)://(www\d?|[a-zA-Z0-9]+)?\.[a-zA-Z0-9-]+(\:|\.)([a-zA-Z0-9.]+|(\d+)?)([/?:].*)?

samayo
  • 13,907
  • 11
  • 78
  • 98
Shantonu
  • 1,060
  • 11
  • 10
4

Here's a ready-to-go Java version from the Android source code. This is the best one I've found.

public static final Matcher WEB  = Pattern.compile(new StringBuilder()                 
.append("((?:(http|https|Http|Https|rtsp|Rtsp):")                      
.append("\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)")                         
.append("\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_")                         
.append("\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?")                         
.append("((?:(?:[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}\\.)+")   // named host                            
.append("(?:")   // plus top level domain                         
.append("(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])")                         
.append("|(?:biz|b[abdefghijmnorstvwyz])")                         
.append("|(?:cat|com|coop|c[acdfghiklmnoruvxyz])")                         
.append("|d[ejkmoz]")                         
.append("|(?:edu|e[cegrstu])")                         
.append("|f[ijkmor]")                         
.append("|(?:gov|g[abdefghilmnpqrstuwy])")                         
.append("|h[kmnrtu]")                         
.append("|(?:info|int|i[delmnoqrst])")                         
.append("|(?:jobs|j[emop])")                         
.append("|k[eghimnrwyz]")                         
.append("|l[abcikrstuvy]")                         
.append("|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])")                         
.append("|(?:name|net|n[acefgilopruz])")                         
.append("|(?:org|om)")                         
.append("|(?:pro|p[aefghklmnrstwy])")                         
.append("|qa")                         
.append("|r[eouw]")                         
.append("|s[abcdeghijklmnortuvyz]")                         
.append("|(?:tel|travel|t[cdfghjklmnoprtvwz])")                         
.append("|u[agkmsyz]")                         
.append("|v[aceginu]")                         
.append("|w[fs]")                         
.append("|y[etu]")                         
.append("|z[amw]))")                         
.append("|(?:(?:25[0-5]|2[0-4]") // or ip address                                                 
.append("[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(?:25[0-5]|2[0-4][0-9]")                             
.append("|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1]")                         
.append("[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}")                         
.append("|[1-9][0-9]|[0-9])))")                         
.append("(?:\\:\\d{1,5})?)") // plus option port number                             
.append("(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\@\\&\\=\\#\\~")  // plus option query params                         
.append("\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?")                         
.append("(?:\\b|$)").toString()                 
).matcher("");
kash
  • 189
  • 1
  • 7
  • This don't work with "New gTLDs", check http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains & http://newgtlds.icann.org/en/program-status/delegated-strings. Hardcoding list of TLD is bad practice... Some public suffix lists are available, they include recent variant of TLD: https://publicsuffix.org/ (used in Firefox, Chrome, IE) – osgx Feb 28 '17 at 02:17
  • My first thought at seeing this: there's no kill like overkill. They literally took all ccTLDs and built a regex to match them specifically. Cuts down on false positives, I suppose, but a terrible way to handle the situation. – ndm13 Apr 17 '17 at 23:18
4

Here is a regex I made which extracts the different parts from an URL:

^((?:https?|ftp):\/\/?)?([^:/\s.]+\.[^:/\s]|localhost)(:\d+)?((?:\/\w+)*\/)?([\w\-.]+[^#?\s]+)?([^#]+)?(#[\w-]+)?$

((?:https?|ftp):\/\/?)?(group 1): extracts the protocol
([^:/\s.]+\.[^:/\s]|localhost)(group 2): extracts the hostname
(:\d+)?(group 3): extracts the port number
((?:\/\w+)*\/)?([\w\-.]+[^#?\s]+)?(groups 4 & 5): extracts the path part
([^#]+)?(group 6): extracts the query part
(#[\w-]+)?(group 7): extracts the hash part

For every part of the regex listed above, you can remove the ending ? to force it (or add one to make it facultative). You can also remove the ^ at the beginning and $ at the end of the regex so it won't need to match the whole string.

See it on regex101.

Note: this regex is not 100% safe and may accept some strings which are not necessarily valid URLs but it does indeed validate some criterias. Its main goal was to extract the different parts of an URL not to validate it.

Elie G.
  • 1,050
  • 17
  • 32
  • Thanks. The group approach to these answers is best. Here's hoping for updates following the direction of [this article](http://jmrware.com/articles/2009/uri_regexp/URI_regex.html) linked on the next page, and a revision of the "not 100% safe". A quantification like 99.9% is enough for most readers. :P – Laurie Stearn Apr 10 '19 at 09:22
3

For convenience here's a one-liner regexp for URL's that will also match localhost where you're more likely to have ports than .com or similar.

(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}(\.[a-z]{2,6}|:[0-9]{3,4})\b([-a-zA-Z0-9@:%_\+.~#?&\/\/=]*)
miphe
  • 1,306
  • 13
  • 28
3

I found the following Regex for URLs, tested successfully with 500+ URLs:

/\b(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?\b/gi

I know it looks ugly, but the good thing is that it works. :)

Explanation and demo with 581 random URLs on regex101.

Source: In search of the perfect URL validation regex

Rahul Desai
  • 13,802
  • 14
  • 75
  • 128
  • 3
    Your regex is doing the work in 155'000 steps. Here is another regex that is evaluating all the 580 URLS your provided in 19'000 steps [regex101 link](https://regex101.com/r/hU9aV3/7): `/(https?):\/\/([\w-]+(\.[\\w-]+)*\.([a-z]+))(([\w.,@?^=%&:\/~+#()!-]*)([\w@?^=%&\/~+#()!-]))?/gi` – Jonathan Maim Nov 10 '15 at 04:42
3

To Match a URL there are various option and it depend on you requirement. below are few.

_(^|[\s.:;?\-\]<\(])(https?://[-\w;/?:@&=+$\|\_.!~*\|'()\[\]%#,☺]+[\w/#](\(\))?)(?=$|[\s',\|\(\).:;?\-\[\]>\)])_i

#\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))#iS

And there is a link which gives you more than 10 different variations of validation for URL.

https://mathiasbynens.be/demo/url-regex

maxspan
  • 10,656
  • 12
  • 61
  • 88
3

I hope it's helpful for you...

^(http|https):\/\/+[\www\d]+\.[\w]+(\/[\w\d]+)?
Ravi Matani
  • 794
  • 1
  • 7
  • 21
2

I tried to formulate my version of url. My requirement was to capture instances in a String where possible url can be cse.uom.ac.mu - noting that it is not preceded by http nor www

String regularExpression = "((((ht{2}ps?://)?)((w{3}\\.)?))?)[^.&&[a-zA-Z0-9]][a-zA-Z0-9.-]+[^.&&[a-zA-Z0-9]](\\.[a-zA-Z]{2,3})";

assertTrue("www.google.com".matches(regularExpression));
assertTrue("www.google.co.uk".matches(regularExpression));
assertTrue("http://www.google.com".matches(regularExpression));
assertTrue("http://www.google.co.uk".matches(regularExpression));
assertTrue("https://www.google.com".matches(regularExpression));
assertTrue("https://www.google.co.uk".matches(regularExpression));
assertTrue("google.com".matches(regularExpression));
assertTrue("google.co.uk".matches(regularExpression));
assertTrue("google.mu".matches(regularExpression));
assertTrue("mes.intnet.mu".matches(regularExpression));
assertTrue("cse.uom.ac.mu".matches(regularExpression));

//cannot contain 2 '.' after www
assertFalse("www..dr.google".matches(regularExpression));

//cannot contain 2 '.' just before com
assertFalse("www.dr.google..com".matches(regularExpression));

// to test case where url www must be followed with a '.'
assertFalse("www:google.com".matches(regularExpression));

// to test case where url www must be followed with a '.'
//assertFalse("http://wwwe.google.com".matches(regularExpression));

// to test case where www must be preceded with a '.'
assertFalse("https://www@.google.com".matches(regularExpression));
Ashish
  • 490
  • 3
  • 14
  • 12
    you really use `ht{2}ps?` rather then `https?` – Roee Gavirel Jan 28 '13 at 16:25
  • 2
    It should give the same result, but yeah you are right. But I was on an experimental phase of regular expression and wanted to try all its syntax. Thanks for pointing this out. – Ashish Feb 22 '13 at 18:59
  • Can you please help me providing a regex like this one that match query parameters and other path too? like "www.awebsite.com/path?param=value" – thermz Jun 18 '13 at 17:17
2

whats wrong with plain and simple FILTER_VALIDATE_URL ?

 $url = "http://www.example.com";

if(!filter_var($url, FILTER_VALIDATE_URL))
  {
  echo "URL is not valid";
  }
else
  {
  echo "URL is valid";
  }

I know its not the question exactly but it did the job for me when I needed to validate urls so thought it might be useful to others who come across this post looking for the same thing

jojojohn
  • 669
  • 2
  • 10
  • 18
  • 1
    This question is looking for a regexp but you suggest using some filter constant. Do you know how does it searches for links internally? – Kuitsi Jun 19 '13 at 07:45
  • The question is: "What is the best regular expression to check if a string is a valid URL?" sometimes the problem is not to check a String that is supposed to be an URL, sometimes you have a text and you need to read all the URLs in that text, and using REGEX is the only way. Furthermore the OP asks for a solution without specifing a particular language, your solution can be applied only in a specific platform. – thermz Jun 19 '13 at 07:49
2

The following RegEx will work:

"@((((ht)|(f))tp[s]?://)|(www\.))([a-z][-a-z0-9]+\.)?([a-z][-a-z0-9]+\.)?[a-z][-a-z0-9]+\.[a-z]+[/]?[a-z0-9._\/~#&=;%+?-]*@si"
Mohammad Anini
  • 4,622
  • 3
  • 34
  • 43
2

Use this one its working for me

function validUrl(Url) {
    var myRegExp  =/^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$/i;

    if (!RegExp.test(Url.value)) {
        $("#urlErrorLbl").removeClass('highlightNew');
        return false;
    } 

    $("#urlErrorLbl").addClass('highlightNew'); 
    return true; 
}
Felipe Miosso
  • 7,172
  • 6
  • 40
  • 54
Vinoth K S
  • 88
  • 8
2

You don't specify which language you're using. If PHP is, there is a native function for that:

$url = 'http://www.yoururl.co.uk/sub1/sub2/?param=1&param2/';

if ( ! filter_var( $url, FILTER_VALIDATE_URL ) ) {
    // Wrong
}
else {
    // Valid
}

Returns the filtered data, or FALSE if the filter fails.

Check it here >>

Hope it helps.

Fredmat
  • 772
  • 4
  • 12
  • 29
2
https?:\/{2}(?:[\/-\w.]|(?:%[\da-fA-F]{2}))+

You can use this pattern for detecting URLs.

Following is the proof of concept

RegExr: URL Detector

Sajeeb Chandan
  • 486
  • 6
  • 9
2

I think some people weren't able to use your php code because of the modifiers implied. I copied your code as is and used as an example:

if(
    preg_match(
        "/^{$IRI_reference}$/iu",
        'http://www.url.com'
    )
){
    echo 'true';
}

Notice the "i" and "u" modifiers. without "u" php throws an exception saying:

Warning: preg_match() [function.preg-match]: Compilation failed: character value in \x{...} sequence is too large at offset XX
jww
  • 83,594
  • 69
  • 338
  • 732
vortex
  • 853
  • 8
  • 14
1

To Check URL regex would be:

^http(s{0,1})://[a-zA-Z0-9_/\\-\\.]+\\.([A-Za-z/]{2,5})[a-zA-Z0-9_/\\&\\?\\=\\-\\.\\~\\%]*
Reetika
  • 1,137
  • 1
  • 15
  • 24
1

This is not a regular expression but accomplishes the same thing (Javascript only):

function isAValidUrl(url) {
  try {
    new URL(url);
    return true;
  } catch(e) {
    return false;
  }
}
AndroidDev
  • 20,063
  • 26
  • 131
  • 216
  • The problem with this is that h ttp://bla is a valid URL (the space between h and t is so SO doesn't make it an actual URL) – Ali Habibzadeh Dec 06 '17 at 14:43
1

How about this:

^(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9]\.[^\s]{2,})$

These are the test cases:

Test cases

You can try it out in here : https://regex101.com/r/mS9gD7/41

tk_
  • 13,042
  • 6
  • 71
  • 81
1

As far as I have found, this expression is good for me-

(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9]\.[^\s]{2,})

Working example-

function RegExForUrlMatch()
{
  var expression = /(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9]\.[^\s]{2,})/g;

  var regex = new RegExp(expression);
  var t = document.getElementById("url").value;

  if (t.match(regex)) {
    document.getElementById("demo").innerHTML = "Successful match";
  } else {
    document.getElementById("demo").innerHTML = "No match";
  }
}
<input type="text" id="url" placeholder="url" onkeyup="RegExForUrlMatch()">

<p id="demo">Please enter a URL to test</p>
0

This is a rather old thread now and the question asks for a regex based URL validator. I ran into the thread whilst looking for precisely the same thing. While it may well be possible to write a really comprehensive regex to validate URLs. I eventually settled on another way to do things - by using PHP's parse_url function.

It returns boolean false if the url cannot be parsed. Otherwise, it returns the scheme, the host and other information. This may well not be enough for a comprehensive URL check on its own, but can be drilled down into for further analysis. If the intent is to simply catch typos, invalid schemes etc. It is perfectly adequate!

DroidOS
  • 7,220
  • 11
  • 70
  • 142
0

Here is the best and the most matched regex for this situation

^(?:http(?:s)?:\/\/)?(?:www\.)?(?:[\w-]*)\.\w{2,}$
M.R.Safari
  • 1,644
  • 3
  • 26
  • 38
0

To match the URL up to the domain:

(^(\bhttp)(|s):\/{2})(?=[a-z0-9-_]{1,255})\.\1\.([a-z]{3,7}$)

It can be simplified to:

(^(\bhttp)(|s):\/{2})(?=[a-z0-9-_.]{1,255})\.([a-z]{3,7})

the latter does not check for the end for the end line so that it can be later used create full blown URL with full paths and query strings.

runlevel0
  • 1,933
  • 2
  • 19
  • 26
0

This should work:

function validateUrl(value){
 return /^(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)$/gi.test(value);
}

console.log(validateUrl('google.com')); // true
console.log(validateUrl('www.google.com')); // true
console.log(validateUrl('http://www.google.com')); // true
console.log(validateUrl('http:/www.google.com')); // false
console.log(validateUrl('www.google.com/test')); // true
Daniel Mihai
  • 147
  • 1
  • 6
0

I think I found a more general regexp to validate urls, particularly websites

​(https?:\/\/)?(www\.)[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,4}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)|(https?:\/\/)?(www\.)?(?!ww)[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,4}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)

it does not allow for instance www.something or http://www or http://www.something

Check it here: http://regexr.com/3e4a2

MithPaul
  • 119
  • 1
  • 14
0

I created a similar regex (PCRE) to the one @eyelidlessness provided following RFC3987 along with other RFC documents. The major difference between @eyelidlessness and my regex are mainly readability and also URN support.

The regex below is all one piece (instead of being mixed with PHP) so it can be used in different languages very easily (so long as they support PCRE)

The easiest way to test this regex is to use regex101 and copy paste the code and test strings below with the appropriate modifiers (gmx).

To use this regex in PHP, insert the regex below into the following code:

$regex = <<<'EOD'
// Put the regex here
EOD;


You can match a link without a scheme by doing the following:
To match a link without a scheme (i.e. john.doe@gmail.com or www.google.com/pathtofile.php?query), replace this section:
  (?:
    (?<scheme>
      (?<urn>urn)|
      (?&d_scheme)
    )
    :
  )?

with this:

  (?:
    (?<scheme>
      (?<urn>urn)|
      (?&d_scheme)
    )
    :
  )?

Note, however, that by replacing this, the regex does not become 100% reliable.


Regex (PCRE) with gmx modifiers for the multi-line test string below
(?(DEFINE)
  # Definitions
  (?<ALPHA>[\p{L}])
  (?<DIGIT>[0-9])
  (?<HEX>[0-9a-fA-F])
  (?<NCCHAR>
    (?&UNRESERVED)|
    (?&PCT_ENCODED)|
    (?&SUB_DELIMS)|
    @
  )
  (?<PCHAR>
    (?&UNRESERVED)|
    (?&PCT_ENCODED)|
    (?&SUB_DELIMS)|
    :|
    @|
    \/
  )
  (?<UCHAR>
    (?&UNRESERVED)|
    (?&PCT_ENCODED)|
    (?&SUB_DELIMS)|
    :
  )
  (?<RCHAR>
    (?&UNRESERVED)|
    (?&PCT_ENCODED)|
    (?&SUB_DELIMS)
  )
  (?<PCT_ENCODED>%(?&HEX){2})
  (?<UNRESERVED>
    ((?&ALPHA)|(?&DIGIT)|[-._~])
  )
  (?<RESERVED>(?&GEN_DELIMS)|(?&SUB_DELIMS))
  (?<GEN_DELIMS>[:\/?#\[\]@])
  (?<SUB_DELIMS>[!$&'()*+,;=])
  # URI Parts
  (?<d_scheme>
    (?!urn)
    (?:
      (?&ALPHA)
      ((?&ALPHA)|(?&DIGIT)|[+-.])*
      (?=:)
    )
  )
  (?<d_hier_part_slashes>
    (\/{2})?
  )
  (?<d_authority>(?&d_userinfo)?)
  (?<d_userinfo>(?&UCHAR)*)
  (?<d_ipv6>
    (?![^:]*::[^:]*::[^:]*)
    (
      (
        ((?&HEX){0,4})
        :
      ){1,7}
      ((?&d_ipv4)|:|(?&HEX){1,4})
    )
  )
  (?<d_ipv4>
    ((?&octet)\.){3}
    (?&octet)
  )
  (?<octet>
    (
      25[]0-5]|
      2[0-4](?&DIGIT)|
      1(?&DIGIT){2}|
      [1-9](?&DIGIT)|
      (?&DIGIT)
    )
  )
  (?<d_reg_name>(?&RCHAR)*)
  (?<d_urn_name>(?&UCHAR)*)
  (?<d_port>(?&DIGIT)*)
  (?<d_path>
    (
      \/
      ((?&PCHAR)*)*
      (?=\?|\#|$)
    )
  )
  (?<d_query>
    (
      ((?&PCHAR)|\/|\?)*
    )?
  )
  (?<d_fragment>
    (
      ((?&PCHAR)|\/|\?)*
    )?
  )
)
^
(?<link>
  (?:
    (?<scheme>
      (?<urn>urn)|
      (?&d_scheme)
    )
    :
  )
  (?(urn)
    (?:
      (?<namespace_identifier>[0-9a-zA-Z\-]+)
      :
      (?<namespace_specific_string>(?&d_urn_name)+)
    )
    |
    (?<hier_part>
      (?<slashes>(?&d_hier_part_slashes))
      (?<authority>
        (?:
          (?<userinfo>(?&d_authority))
          @
        )?
        (?<host>
          (?<ipv4>\[?(?&d_ipv4)\]?)|
          (?<ipv6>\[(?&d_ipv6)\])|
          (?<domain>(?&d_reg_name))
        )
        (?:
          :
          (?<port>(?&d_port))
        )?
      )
      (?<path>(?&d_path))?
    )
    (?:
      \?
      (?<query>(?&d_query))
    )?
    (?:
      \#
      (?<fragment>(?&d_fragment))
    )?
  )
)
$

Test Strings

# Valid URIs
ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm
ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
ldap://[2001:db8::7]/c=GB?objectClass?one
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:isbn:0451450523
urn:oid:2.16.840
urn:isan:0000-0000-9E59-0000-O-0000-0000-2
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
http://localhost/test/somefile.php?query=someval&variable=value#fragment
http://[2001:db8:a0b:12f0::1]/test
ftp://username:password@domain.com/path/to/file/somefile.html?queryVariable=value#fragment
https://subdomain.domain.com/path/to/file.php?query=value#fragment
https://subdomain.example.com/path/to/file.php?query=value#fragment
mailto:john.smith(comment)@example.com
mailto:user@[2001:DB8::1]
mailto:user@[255:192:168:1]
mailto:M.Handley@cs.ucl.ac.uk
http://localhost:4433/path/to/file?query#fragment
# Note that the example below IS a valid as it does follow RFC standards
localhost:4433/path/to/file

# These work with the optional scheme group although I'd suggest making the scheme mandatory as misinterpretations can occur
john.doe@gmail.com
www.google.com/pathtofile.php?query
[192a:123::192.168.1.1]:80/path/to/file.html?query#fragment
ctwheels
  • 19,377
  • 6
  • 29
  • 60
0

Below expression will work for all popular domains. It will accept following urls:

In addition it will make message with url as link also
e.g. please visit yourwebsite.com
In above example it will make yourwebsite.com as hyperlink

if (new RegExp("([-a-z0-9]{1,63}\\.)*?[a-z0-9][-a-z0-9]{0,61}[a-z0-9]\\.(com|com/|org|gov|cm|net|online|live|biz|us|uk|co.us|co.uk|in|co.in|int|info|edu|mil|ca|co|co.au|org/|gov/|cm/|net/|online/|live/|biz/|us/|uk/|co.us/|co.uk/|in/|co.in/|int/|info/|edu/|mil/|ca/|co/|co.au/)(/[-\\w@\\+\\.~#\\?*&/=% ]*)?$").test(strMessage) || (new RegExp("^[a-z ]+[\.]?[a-z ]+?[\.]+[a-z ]+?[\.]+[a-z ]+?[-\\w@\\+\\.~#\\?*&/=% ]*").test(strMessage) && new RegExp("([a-zA-Z0-9]+://)?([a-zA-Z0-9_]+:[a-zA-Z0-9_]+@)?([a-zA-Z0-9.-]+\\.[A-Za-z]{2,4})(:[0-9]+)?(/.*)?").test(strMessage)) || (new RegExp("^[a-z ]+[\.]?[a-z ]+?[-\\w@\\+\\.~#\\?*&/=% ]*").test(strMessage) && new RegExp("([a-zA-Z0-9]+://)?([a-zA-Z0-9_]+:[a-zA-Z0-9_]+@)?([a-zA-Z0-9.-]+\\.[A-Za-z]{2,4})(:[0-9]+)?(/.*)?").test(strMessage))) {
  if (new RegExp("^[a-z ]+[\.]?[a-z ]+?[\.]+[a-z ]+?[\.]+[a-z ]+?$").test(strMessage) && new RegExp("([a-zA-Z0-9]+://)?([a-zA-Z0-9_]+:[a-zA-Z0-9_]+@)?([a-zA-Z0-9.-]+\\.[A-Za-z]{2,4})(:[0-9]+)?(/.*)?").test(strMessage)) {
    var url1 = /(^|&lt;|\s)([\w\.]+\.(?:com|org|gov|cm|net|online|live|biz|us|uk|co.us|co.uk|in|co.in|int|info|edu|mil|ca|co|co.au))(\s|&gt;|$)/g;
    var html = $.trim(strMessage);
    if (html) {
      html = html.replace(url1, '$1<a style="color:blue; text-decoration:underline;" target="_blank"  href="http://$2">$2</a>$3');
    }
    returnString = html;
    return returnString;
  } else {
    var url1 = /(^|&lt;|\s)(www\..+?\.(?:com|org|gov|cm|net|online|live|biz|us|uk|co.us|co.uk|in|co.in|int|info|edu|mil|ca|co|co.au)[^,\s]*)(\s|&gt;|$)/g,
      url2 = /(^|&lt;|\s)(((https?|ftp):\/\/|mailto:).+?\.(?:com|org|gov|cm|net|online|live|biz|us|uk|co.us|co.uk|in|co.in|int|info|edu|mil|ca|co|co.au)[^,\s]*)(\s|&gt;|$)/g,
      url3 = /(^|&lt;|\s)([\w\.]+\.(?:com|org|gov|cm|net|online|live|biz|us|uk|co.us|co.uk|in|co.in|int|info|edu|mil|ca|co|co.au)[^,\s]*)(\s|&gt;|$)/g;

    var html = $.trim(strMessage);
    if (html) {
      html = html.replace(url1, '$1<a style="color:blue; text-decoration:underline;" target="_blank"  href="http://$2">$2</a>$3').replace(url2, '$1<a style="color:blue; text-decoration:underline;" target="_blank"  href="$2">$2</a>$5').replace(url3, '$1<a style="color:blue; text-decoration:underline;" target="_blank"  href="http://$2">$2</a>$3');
    }
    returnString = html;

    return returnString;
  }
}
awran5
  • 3,433
  • 2
  • 10
  • 26
Ravi Matani
  • 794
  • 1
  • 7
  • 21
0

After rigorous searching i finally settled with the following

^[a-zA-Z0-9]+\:\/\/[a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.?[a-zA-Z0-9]+$|^[a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.[a-zA-Z0-9]+$

And this thing work for general in future URLs.

dev_khan
  • 592
  • 6
  • 14
0

The best regex, i've found is: /(^|\s)((https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?)/gi

For ios swift : (^|\\s)((https?:\\/\\/)?[\\w-]+(\\.[\\w-]+)+\\.?(:\\d+)?(\\/\\S*)?)

http://jsfiddle.net/9BYdp/1/

Found here

Nik Kov
  • 10,605
  • 4
  • 56
  • 96
0

Interestingly, none of the answers above worked for what I needed, so I figured I would offer my solution. I needed to be able to do the following:

  • Match http(s)://www.google.com, http://google.com, www.google.com, and google.com
  • Match Github markdown style links like [Google](http://www.google.com)
  • Match all possible domain extensions, like .com, or .io, or .guru, etc. Basically anything between 2-6 characters in length
  • Split everything into proper groupings so that I could access each part as needed.

Here was the solution:

/^(\[[A-z0-9 _]*\]\()?((?:(http|https):\/\/)?(?:[\w-]+\.)+[a-z]{2,6})(\))?$

This gives me all of the above requirements. You could optionally add the ability for ftp and file if necessary:

/^(\[[A-z0-9 _]*\]\()?((?:(http|https|ftp|file):\/\/)?(?:[\w-]+\.)+[a-z]{2,6})(\))?$
Erick Maynard
  • 631
  • 6
  • 16
0

I think it is a very simple way. And it works very good.

var hasURL = (str) =>{
 var url_pattern = new RegExp("(www.|http://|https://|ftp://)\w*");
 if(!url_pattern.test(str)){
  document.getElementById("demo").innerHTML = 'No URL';
 }
 else
  document.getElementById("demo").innerHTML = 'String has a URL';
};
<p>Please enter a string and test it has any url or not</p>
<input type="text" id="url" placeholder="url" onkeyup="hasURL(document.getElementById('url').value)">
<p id="demo"></p>
Mahfuzur Rahman
  • 1,272
  • 13
  • 21
  • Your regex doesn't work at all bro. All it validates is that your string contains either `www` immediately followed by **one** character (any character since you haven't escaped the `.`) or `http://` or `https://` or `ftp://` and any of these **can** be followed by any alphanumeric characters. So, in other words, all the following strings would result as being valid but they are obviously not valid urls : `www.`, `www▓`, `£¢¤£¢¤www¢` (See on [regex101](https://regex101.com/r/WOAt0M/2/tests)). You could have used a shorter regex: `(www.|(https?|ftp)://)\w*`. (This is still not a good regex btw) – Elie G. Dec 10 '18 at 04:40
  • Obviously www. , www▓, £¢¤£¢¤www¢ those are not valid urls. But I think, those are not also meaningful string. I just try to simplify the url pattern. @ DrunkenPoney – Mahfuzur Rahman Dec 10 '18 at 05:34
  • My goal wasn't to write *meaningful* strings but to show that weird strings would be accepted and anyway since your regex *validate* for `www` I suppose you don't necessarily need the protocol to be specified but your regex wouldn't allow urls like `google.com`. Moreover, one of the problems I was trying to show you is that your regex matches wherever the *validation parts* (`www`, `http`, ...) are in the string. You could at least specify that your string needs to start with it. – Elie G. Dec 10 '18 at 16:35
  • And if you want a quick regex to validate url but is not 100% safe [here](https://regex101.com/r/Q2ilqN/7) is one I made which I used to extract the different parts from an url but can be used to validate that a string contains the base parts of an url. – Elie G. Dec 10 '18 at 16:38
0

IMPROVED

Detects Urls like these:

Regex:

/^(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$/gm
0

If you would like to apply a more strict rule, here is what I have developed:

isValidUrl(input) {
    var regex = /^(((H|h)(T|t)(T|t)(P|p)(S|s)?):\/\/)?[-a-zA-Z0-9@:%._\+~#=]{2,100}\.[a-zA-Z]{2,10}(\/([-a-zA-Z0-9@:%_\+.~#?&//=]*))?/
    return regex.test(input)
}
Kerem
  • 317
  • 1
  • 14
0

Regardless the broad question asked, I post this for anyone in the future who is looking for something simple... as I think validating a URL has no perfect regular expression that fit all needs, it depends on your requirements, i.e: in my case, I just needed to verify if a URL is in the form of domain.extension and I wanted to allow the www or any other subdomain like blog.domain.extension I don't care about http(s) as in my app I have a field which says "enter the URL" so it's obvious what that entered string is.

so here is the regEx:

/^(www\.|[a-zA-Z0-9](.*[a-zA-Z0-9])?\.)?((?!www)[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9])\.[a-z]{2,5}(:[0-9]{1,5})?$/i

The first block in this regExp is:

(www\.|[a-zA-Z0-9](.*[a-zA-Z0-9])?\.)? ---> we start to check if the URL start with www. or [a-zA-Z0-9](.*[a-zA-Z0-9])? which means a letterOrNumber + (anyCharacter(0 or multiple times) + another letterOrNumber) followed with a dot

Note that the (.*[a-zA-Z0-9])?\.)? we translated by (anyCharacter(0 or multiple times) + another letterOrNumber) is optional (can be or not) that's why we grouped it between parentheses and followed with the question mark ?

the whole block we discussed so far is also put between parentheses and followed by ? which means both www or any other word (that represents a subdomain) is optional.

The second part is: ((?!www)[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9])\. ---> which represents the "domain" part, it can be any word (except www) starting with an alphabet or a number + any other alphabet (including dash "-") repeated one or more time, and ending with any alphabet or number followed with a dot.

The final part is [a-z]{2,} ---> which represent the "extension", it can be any alphabet repeated 2 or more times, so it can be com, net, org, art basically any extension

medBouzid
  • 6,333
  • 8
  • 46
  • 72
0

A simple check for URL is

^(ftp|http|https):\/\/[^ "]+$
-1
/^(http|HTTP)+(s|S)?:\/\/[\w.-]+(?:\.[\w\.-]+)+[\w\-\._\$\(\)/]+$/g

check demo with tests:

https://regexr.com/5cedu

Wai Ha Lee
  • 7,664
  • 52
  • 54
  • 80
manmeet
  • 25
-1

The following Regex works for me:

(http(s)?:\/\/.)?(ftp(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{0,256}\.[a-z] 
{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)

matches:

https://google.com t.me https://t.me ftp://google.com http://sm.tj http://bro.tj t.me/rshss https:google.com www.cool.com.au http://www.cool.com.au http://www.cool.com.au/ersdfs http://www.cool.com.au/ersdfs?dfd=dfgd@s=1 http://www.cool.com:81/index.html
Robson
  • 766
  • 5
  • 20
  • 36