1

This is my RegExp:

const urlReg = /((\w*?)((:\/\/)|www|\w\.{1}\w{2,})[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)/g;

https://regex101.com/r/rET1Le/1

I have excluded url in tags, so I have only one issue with last dot in this url: https://testask.com/item/45e20490-2b86-4b6a-8772-5ed96f64de52. Does somebody know how to modify my RegExp to avoid this dot?

CertainPerformance
  • 260,466
  • 31
  • 181
  • 209
Michael Lonin
  • 19
  • 1
  • 4
  • 1
    So you want it to _not_ catch the dot? – Charles Shiller May 06 '18 at 06:26
  • Please ask a clearer question: show us what your regexp must do: what it should validate, what it should not validate. It could greatly help users to understand your question and solve it faster! – sjahan May 06 '18 at 06:48
  • How about including [`(?=\.?)\b`](https://regex101.com/r/rET1Le/2) in the last part of your regex? Your original regex matches `https://testask.com/item/45e20490-2b86-4b6a-8772-5ed96f64de52.` but the new regex will discard the last dot. – MAGA May 06 '18 at 06:51
  • 2
    What if the dot is **supposed** to be part of the URL? It is [valid](https://stackoverflow.com/questions/7555553/can-period-be-part-of-the-path-part-of-an-url) – Nick May 06 '18 at 07:01

1 Answers1

0

If got you right, the offending case is the second match in your sample that has a . at the and of the match. With a PCRE regex with could solve this easily with lookbehind assertion (?<!\.) like that:

((\w*?)((:\/\/)|www|\w\.{1}\w{2,})[^"<\s]+(?<!\.))(?![^<>]*>|[^"]*?<\/a)

Unfortunately, this does not work in (current) JavaScript's regex engines. As an alternative we can use the (?:(?!avoid).)+ pattern to exclude add the dot before your inner everything-but pattern [^"<\s]+; however, it get's a bit messy since you have to use multiple alternations sorted by length (long to short) to account for the case where there is a final . before <|"|\s:

((\w*?)((:\/\/)|www|\w\.{1}\w{2,})(:?(?!\.\s|\s|\."|"\.<|<).)+)(?![^<>]*>|[^"]*?<\/a)

const regex = /((\w*?)((:\/\/)|www|\w\.{1}\w{2,})(:?(?!\.\s|\s|\."|"\.<|<).)+)(?![^<>]*>|[^"]*?<\/a)/g;
const str = `djfhjkshd fjkshkdjfhsjkdhfjk jdsfh ksjdfksd fkdsf dkfh kjh<br>You can open your link here: https://testask.com/item/45e20490-2b86-4b6a-8772-5ed96f64de52. dsjfklj skldjfklsdjfkl. dsjfjshdfjk skdhfshdfj skdhfjshfjsahfjhasjfh shfk.<br>sdkfhklsdjf kljsdklf kdsljfkljafkljkl .`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }  
    console.log(m[0]);
    // The result can be accessed through the `m`-variable.
    //m.forEach(match => {
    //    console.log(`Found match: ${match}`);
    //});
}

The easiest solution is, however, simply trim the trailing dot afterwards.

wp78de
  • 16,078
  • 6
  • 34
  • 56