0

For the following regex expression:

var regex = new RegExp("^(www\\.)?[0-9A-Za-z-\\.@:%_\+~#=]+(\\.[a-zA-Z]{2,})+(/.*)?(\\?.*)?");

I don't understand why the string "www.goo gle.com" passes the regex test. When I did this:

var regex = new RegExp("^(www\\.)?[0-9A-Za-z-\\.@:%_\+~#=]+(\\.[a-zA-Z]{2,})+(/.*)?(\\?.*)?$");

i.e. adding $ in the end of the regex string prevents the above string passing, which is what I would want.

I tried finding a "simulator" online to help me figure out how the regex is matching but couldn't find much help.

name_masked
  • 8,772
  • 37
  • 105
  • 164
  • What is your question?! – revo Jun 26 '17 at 19:06
  • @PeterOlson: Yes, but why does the regex work on adding the `$` at the end of regex pattern. Shouldn't it still match `gle.com` pattern. – name_masked Jun 26 '17 at 19:09
  • Are you sure? @PeterOlson – revo Jun 26 '17 at 19:10
  • @revo: My question is why regex works on adding `$` at the end of regex pattern. – name_masked Jun 26 '17 at 19:11
  • Doesn't it function the same with/without the `$`? You only require 1 or more `0-9A-Za-z-\\.@:%_\+~#=`, then one or more instance of `\.[a-zA-Z]{2,}`. – chris85 Jun 26 '17 at 19:11
  • 1
    In first regular expression you are doing a *partial* match since no exact match is considered. So as soon as a match is found engine is satisfied. In contrast enclosing whole regex with *beginning of input string* and *end of input string* anchors (`^` & `$`) means an exact match which starts from beginning and should finish at the end of input string otherwise it fails. – revo Jun 26 '17 at 19:22
  • @sln: Weird, doesn't complain about invalid range to me. the `... z-\\.@ ..` is taken literally and not as a range. – name_masked Jun 26 '17 at 19:23
  • @name_masked - Yeah it's funny like that, any ambiguity and the `-` is taken literally. I.e. `[a-z-A-Z]`. But, some engines are strict that way and require it to be escaped if ambiguous. Otherwise, if the engine is lame, and has weird internal parsing rules, it might do it's own interpretation, and the result is undefined behavior (like this?). –  Jun 26 '17 at 19:28
  • To the OP, don't use `.*` anywhere in the regex. Use `\S*?` if you don't want whitespace. And, might want to use or modify a more commercial regex for url's `^(?!mailto:)(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:\/[^\s]*)?$` –  Jun 26 '17 at 19:31
  • Put your regex into regex101.com and you'll see why it's working. It matches everything up to the space because the rest is all optional. – Barmar Jun 26 '17 at 19:34
  • @name_masked: [Your regex matches partially](https://regex101.com/r/JtAUev/1) and matches `www.goo` when input is `www.goo gle.com`. So no it is not matching space everywhere but due to missing end anchor it matches partially. – anubhava Jun 26 '17 at 19:35

1 Answers1

2

www.goo gle.com passes the test since, www. is matched by [0-9A-Za-z-\\.@:%_\+~#=]+ and goo is matched by (\.[a-zA-Z]{2,})+. In contrast, (www\\.)?, and the last two groups are optional, so the regex is satisfied even if they are not matched, hence there's no need to further match gle.com.

By adding $, the regex no longer matches, since the space is not matched by any of the subexpressions.

valiano
  • 10,373
  • 4
  • 36
  • 60