6

I'm working on a regex for validating urls in C#. Right now, the regex I need must not match other http:// but the first one inside the url. This was my first try:

(https?:\/\/.+?)\/(.+?)(?!https?:\/\/)

But this regex does not work (even removing (?!https?:\/\/)). Take for example this input string:

http://test.test/notwork.http://test

Here is my first doubt: why does not the capturing group (.+?) match notwork.http://test? The lazy quantifier should match as few times as possible but why not until the end? In this case I was certainly missing something (Firstly I thought it could be related to backtracking but I don't think this is the case), so I read this and found a solution, even if I'm not sure is the best one since it says that

This technique presents no advantage over the lazy dot-star

Anyway, that solution is the tempered dot. This is my next try:

(https?:\/\/.+?)\/((?:(?!https?:\/\/).)*)

Now: this regex is working but not in the way I would like. I need a match only when the url is valid.

By the way, I think I haven't fully understood what the new regex is doing: why the negative lookahead stays before the . and not after it? So I tried moving it after the . and it seems that it matches the url until it finds the second-to-last character before the second http. Returning to the corrected regex, my hypothesis is that the negative lookahead is actually trying to check what's after the . already read by the regex, is this right?

Other solutions are well-accepted, but I'd firstly prefer to understand this one. Thank you.

Marco Luzzara
  • 3,078
  • 3
  • 9
  • 28
  • The question is too broad. The second "doubt" is explained [here](https://stackoverflow.com/questions/30900794/tempered-greedy-token-what-is-different-about-placing-the-dot-before-the-negat). As for the first one, you just needed to use a *positive* lookahead with a `$` as an alternative (`(.*?)(?=https?:\/\/|$)`). A `.+?` matches 1 char, and does not have to match more since it is lazy. – Wiktor Stribiżew Aug 11 '17 at 13:50
  • What do you mean by "I need a match only when the url is valid"? – Wiktor Stribiżew Aug 11 '17 at 14:03
  • About the first doubt: should I use `$` so the lazy quantifier can match until the end of input, right? Why is it not implied? I read your answer about the *tempered greedy token* and it's definitely more clear. I need a match only when the url does not contain other `http://`, whereas with my current regex I have a match when `http://` is included too. By the way, thank you for the answers. – Marco Luzzara Aug 11 '17 at 14:19
  • Looks like you want something like [`(?>https?://\S+?/(?:(?!https?://).)*)(?!https?://)`](https://regex101.com/r/Uuf86f/2). – Wiktor Stribiżew Aug 11 '17 at 14:27
  • You hit the spot. Thank you again. – Marco Luzzara Aug 11 '17 at 14:44

1 Answers1

2

The solution you seek is

(?>https?://\S+?/(?:(?!https?://).)*)(?!https?://)

See the regex demo

Details

  • (?>https?://\S+?/(?:(?!https?://).)*) - an atomic group (allowing no backtracking into its subpatterns) that matches
    • https?:// - http:// or https://
    • \S+? - any 1 or more non-whitespace chars, as few as possible, up to the first...
    • / - / symbol followed with...
    • (?:(?!https?://).)* - zero or more chars (as many as possible) that do not start a sequence of http:// or https:// chars.
  • (?!https?://) - a negative lookahead failing the match if there is http:// or https:// immediately to the right of the current location.

The (https?:\/\/.+?)\/(.+?)(?!https?:\/\/) does not work because the .+? pattern is matching lazily, i.e. it grabs the first char it finds, then lets the subsequent subpattern match. The subsequent subpattern is a negative loolahead that fails the match only in case there is no http:// or https:// immediately to the right of the current location. As there is no such a substring after n in http://test.test/notwork.http://test, the match ending with n is returned, the match succeeds. If you do not tell the regex engine to match more, or up to some other delimiter/pattern, it won't.

The tempered greedy token solution has been talked over a lot. The exact doubt as to where to place the lookahead is covered in this answer.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397