0

How to write regex and differentiate between a) top level URL and b) for links inside these top level URLs.

For e.g, if the top level url is http://www.example.com/ 

and other links inside this top folder can be,
http://www.example.com/go
http://www.example.com/contact/
http://www.example.com/links/

I do not know what links are inside the top folder, is there a regex that can select the main one and also all of these sub folders inside the main one.

Thanks.

dotnet user
  • 1
  • 1
  • 3

2 Answers2

0

I would suggest starting with a regex that breaks an url into its components. There are many examples. This one is taken from Jan Goyvaerts, author of The Regex Cookbook:

(?i)\b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)(?<file>/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(?<parameters>\?[A-Z0-9+&@#/%=~_|!:,.;]*)?

The different segments of the URL are available in various capture groups (in the DEMO, look at the Groups in the right pane.)

Then, if you want to match fewer components, shorten the regex:

^(?im)\b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)/?$

See in the second demo how this one matches the url without the files.

zx81
  • 38,175
  • 8
  • 76
  • 97
0

As you don't want to validate the URL hence simply get the matched groups from index 1(the top level url) and 2 (anything followed by the top level url) that is captured by enclosing it inside the parenthesis (...)

^http:\/\/([^\/]*)\/(.*)$

Here is DEMO and click on code generator link to get the code as well in desired language.

Pattern explanation:

  ^                        the beginning of the string
  http:                    'http:'
  \/                       '/'
  \/                       '/'
  (                        group and capture to \1:
    [^\/]*                   any character except: '\/' (0 or more times (Greedy))
  )                        end of \1
  \/                       '/'
  (                        group and capture to \2:
    .*                       any character except \n (0 or more times (Greedy))
  )                        end of \2
  $                        before an optional \n, and the end of the string

If the URLs are inside the string or span in multi-line then use below regex:

\bhttp:\/\/([^\/]*)\/([^\s]*)

DEMO

Braj
  • 44,339
  • 5
  • 51
  • 69