0

I want my regex to match on every single instance of a word, which has proven difficult when it comes to HTML-content with tags. My regex matches the last instance when it's after a p-tag, and then skip the other ones. Without the last p-tag everything work as expected.

REGEX:

/(?:^|\s|<(p|strong|b|i|em)(.*?)>)(thewordtomatch)(?:^|\s|\.|\,|\;|\:|\?|\!|\<)/gi

String for matching:

<p>Here is thewordtomatch. thewordtomatch is here too. thewordtomatch</p><p>thewordtomatch will however only have a full match on this last entry.</p>

The expected output is a match on every instance of "thewordtomatch". It only matches on the last one, because of the p-tag. Anyway to get this to work?

EDIT:

To be clear, the problem is that the match with the p-tag goes before a match with whitespace, or really removes those matches. if <p>thewordtomatch wouldn't be present, the code works as expected.

Rhyder
  • 103
  • 2
  • 14
  • 4
    This may be relevant: https://stackoverflow.com/a/1732454/1377002 – Andy Jul 16 '19 at 08:48
  • `>)(thewordtomatch)` means that `thewordtomatch` must come immediately after a `>`. `(.*?))(thewordtomatch)` might be what you're looking for, but this is a pretty strange thing to be trying to do - there's almost certainly a more elegant solution – CertainPerformance Jul 16 '19 at 08:49
  • @CertainPerformance While ```(thewordtomatch)``` works in this case, it's not what I'm looking for. I don't want a match on an anchor-tag, div-tag, or any other letter before the word, just when the word is at a line start, after a space, or after one of the tags I've written. – Rhyder Jul 16 '19 at 08:58
  • @CertainPerformance this is a very strange duplicate target IMO, I've seen much worse regex question considered as valid. As for the question, it's because `(.*?)>` won't stop at the end of the tag, capturing all the first `

    ` content, try with `/(?:^|\s|]*)>)(thewordtomatch)(?:^|\s|\.|\,|\;|\:|\?|\!|\
    – Kaddath Jul 16 '19 at 09:00
  • Would you want `Here is thewordtomatch.` to match? If not, then stop using regex and search through the document's nodes programatically instead (if it's not in a document, turn it into one with `DOMParser`) – CertainPerformance Jul 16 '19 at 09:04
  • @Kaddath Ah, thanks! That's exactly what it was. If CertainPerformance can remove the duplicate status and you'll answer I'll make it the correct answer. – Rhyder Jul 16 '19 at 09:07
  • @Andy, though this is relevant, as Kaddath points out, my HTML is parsed and is being treated as a string. – Rhyder Jul 16 '19 at 09:08
  • @Rhyder yes, but if you have a string, it's the situation where you should use a DOM parser on it and get the tag contents from it, most regexes will be tricky to use properly. Also last CertainPerformance's comment is true too, don't expect your regex to only get the word from your specified tags only, it will match with `Here is thewordtomatch.` (because there is a space before the word, so second alternative will trigger) – Kaddath Jul 16 '19 at 09:12
  • @Kaddath In my specific case that's not possible unfortunately. I have to work with what I got. The regex really is much bigger and has negative lookaheads aswell, but I removed that in purpose of explaining the problem and not make it more difficult than it needed to be. – Rhyder Jul 16 '19 at 09:18

0 Answers0