Why is regex not returning all instances when searching in HTML-content?

Question

I want my regex to match on every single instance of a word, which has proven difficult when it comes to HTML-content with tags. My regex matches the last instance when it's after a p-tag, and then skip the other ones. Without the last p-tag everything work as expected.

REGEX:

/(?:^|\s|<(p|strong|b|i|em)(.*?)>)(thewordtomatch)(?:^|\s|\.|\,|\;|\:|\?|\!|\<)/gi

String for matching:

Here is thewordtomatch. thewordtomatch is here too. thewordtomatchthewordtomatch will however only have a full match on this last entry.

The expected output is a match on every instance of "thewordtomatch". It only matches on the last one, because of the p-tag. Anyway to get this to work?

EDIT:

To be clear, the problem is that the match with the p-tag goes before a match with whitespace, or really removes those matches. if thewordtomatch wouldn't be present, the code works as expected.

This may be relevant: https://stackoverflow.com/a/1732454/1377002 — Andy, Jul 16 '19 at 08:48
`>)(thewordtomatch)` means that `thewordtomatch` must come immediately after a `>`. `(.*?))(thewordtomatch)` might be what you're looking for, but this is a pretty strange thing to be trying to do - there's almost certainly a more elegant solution — CertainPerformance, Jul 16 '19 at 08:49
@CertainPerformance While ```(thewordtomatch)``` works in this case, it's not what I'm looking for. I don't want a match on an anchor-tag, div-tag, or any other letter before the word, just when the word is at a line start, after a space, or after one of the tags I've written. — Rhyder, Jul 16 '19 at 08:58
@CertainPerformance this is a very strange duplicate target IMO, I've seen much worse regex question considered as valid. As for the question, it's because `(.*?)>` won't stop at the end of the tag, capturing all the first `
` content, try with `/(?:^|\s|]*)>)(thewordtomatch)(?:^|\s|\.|\,|\;|\:|\?|\!|\ — Kaddath, Jul 16 '19 at 09:00
Would you want `Here is thewordtomatch.` to match? If not, then stop using regex and search through the document's nodes programatically instead (if it's not in a document, turn it into one with `DOMParser`) — CertainPerformance, Jul 16 '19 at 09:04
@Kaddath Ah, thanks! That's exactly what it was. If CertainPerformance can remove the duplicate status and you'll answer I'll make it the correct answer. — Rhyder, Jul 16 '19 at 09:07
@Andy, though this is relevant, as Kaddath points out, my HTML is parsed and is being treated as a string. — Rhyder, Jul 16 '19 at 09:08
@Rhyder yes, but if you have a string, it's the situation where you should use a DOM parser on it and get the tag contents from it, most regexes will be tricky to use properly. Also last CertainPerformance's comment is true too, don't expect your regex to only get the word from your specified tags only, it will match with `Here is thewordtomatch.` (because there is a space before the word, so second alternative will trigger) — Kaddath, Jul 16 '19 at 09:12
@Kaddath In my specific case that's not possible unfortunately. I have to work with what I got. The regex really is much bigger and has negative lookaheads aswell, but I removed that in purpose of explaining the problem and not make it more difficult than it needed to be. — Rhyder, Jul 16 '19 at 09:18

Why is regex not returning all instances when searching in HTML-content?

0 Answers0