How to exclude Regex matches from inside urls with JavaScript

Question

My current regex looks for a searchQuery inside sentences and matches them if those queries start a with a blank space, and end with either a blank space or ?!,.. It generally works well, except for URLs. The regex ends up picking up urls and messing them up.

For example, if I was looking for "bitcoin" in a sentence "Bitcoin price is going nuts", it would find it, but it was also take the following url and match it. https://versionone.vc/the-solar-bitcoin-convergence, messing up the url.

How can I tell JavaScript Regex to ignore any matches where the character before the matching words is either of these / - . _ + ? This will essentially eliminated matches inside urls?

Current Regex: var reg = new RegExp('(\\b)${searchQuery}(\\s+|\\.|\\,|\\?|\\!', 'gi');

Replacement function: newString = oldString.replace(reg, substringReplacement);

substringReplacement(match) is a function that contains the logic of how to change the matching text.

Alternatively, what's another way to outright ignore urls from the searchable area. Thanks!

Try this: `var reg = new RegExp('(? – anubhava Jan 22 '21 at 18:46 — anubhava, Jan 22 '21 at 18:46

score 2 · Answer 1 · answered Jan 22 '21 at 18:50

2

In modern Javascript you can use dynamic length assertion in Javascript so you may try:

var reg = new RegExp('(?<!https?:\/\/\\S*)\\b${searchQuery}[\\s.,?!]', 'gi');

RegEx Demo

(?<!https?:\/\/\\S*) is negative lookbehind that will fail a match if http:// or https:// followed by 0 or more non-whitespace characters is found before the match.

answered Jan 22 '21 at 18:50

anubhava

664,788
59
469
547

1

This seemed to have worked really well, in Chrome. But it is not available in other browsers. Even the RegexDemo site is throwing an error in Safari, for the same code it can do in Chrome. Perhaps negative lookahead isn't available in those browsers. For now, I am going to check for Chrome and use it if available. While trying other solutions for fallback. – Kirill Jan 22 '21 at 22:27

score 1 · Answer 2 · answered Jan 22 '21 at 18:47

1

I'd match the format of a URL or match the searchQuery pattern, then use a replacer function to check if the URL or the searchQuery was matched. In the case of the URL, replace with the URL (so that nothing gets replaced in such a case).

You'll also need to use backticks for a template literal if you want to use ${}-style interpolation.

// make this as elaborate as you want:
// https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url
var reg = new RegExp(`(https?:\/\/\S+)|(\\b)${searchQuery}\\s+|\\.|\\,|\\?|\\!`, 'gi');
newString = oldString.replace(reg, (match, g1) => g1 ? match : substringReplacement);

You also need to make sure the () groups are balanced (in your current code, they aren't, so the new RegExp call will currently throw a SyntaxError)

The substringReplacement isn't shown, but unless you're using the groups to replace, you can probably omit the capturing groups entirely, except for the URL section.

answered Jan 22 '21 at 18:47

CertainPerformance

260,466
31
181
209

I think there are some lessons I can take from here, but I have got to internalize them first. The problem still stands because `match` doesn't know where it came from. So there is no way, that I can see, to tell whether "my-words" came from inside "my words inside a sentence" or from inside a "https://long-stringwhere-my-words-were-used". Also, strangely different behavior between browsers. Will add to the question when I know more. Thank you! – Kirill Jan 23 '21 at 02:56
1

If you want to make sure it's not one long word, and the word boundaries aren't enough, you can capture space characters or the string boundaries on either side. – CertainPerformance Jan 23 '21 at 03:01
I still haven't managed to make it work, but I just noticed that extra `|` in the regex and realized what you meant. Clever! Basically it says "if you find a url, replace it with itself, otherwise call the replacement function". Yeah, that's what I need! Thanks. Just gotta figure out why it's not working on my end. – Kirill Jan 23 '21 at 18:42

score 0 · Answer 3 · answered Jan 23 '21 at 05:49

Although other comments there are more right, as far as Regex is concerned, since negative look ahead isn't supported by Safari, I have for not come up with a workaround. Instead of looking ahead and trying to negate the string, I can look forward and reject matches that are most likely to be a url.

${searchQuery}(?!-|\/|\.com) will skip a big fraction of urls, unless the searchQuery word is the last word in the url.

When I find the perfect answer, I will post it here.

How to exclude Regex matches from inside urls with JavaScript

3 Answers3