3

I'm trying to anonymize a HTML string with regex, for an SQL query.

https://regex101.com/r/QWt1E1/1

(?<!\<)[^<>\s](?!\>)
<p><em>Hi [User</em></p>
<p><em>Tack f&ouml;r visat intresse.</em></p>
<p><em>Good luck!</em><em>&nbsp;</em></p>
<p><em>Sincerely</em></p>
<p><em>nn nnnnn</nm></p>
<p><em>nnnn nnnnnnnn nnnnn nnnnnnnnn</nm></p>
<p><em>nnnn nnnnn</nm><em>nnnnnn</nm></p>
<p><em>nnnnnnnnn</nm></p>

The plan was to replace every character that is not within <>, with an n. It almost works, but in my example it replaces the e in </em>. Not sure why and how to fix that.

How can I adjust the regex to not replace the e in the example?

Znarkus
  • 21,120
  • 20
  • 71
  • 104

1 Answers1

5

Negative lookahead for [^<>]*> instead of just >, to ensure that the current position is not followed by a > before any other angle brackets (because that would indicate you're currently inside a tag).

This also means that you can drop the lookbehind:

[^<>\s](?![^<>]*>)
          ^^^^^^

https://regex101.com/r/QWt1E1/3

Still, it would be better to parse the HTML using an HTML parser, if at all possible

CertainPerformance
  • 260,466
  • 31
  • 181
  • 209
  • Amazing, thank you! Needed it to be regex, to run it in an SQL query against a postgresql database. – Znarkus May 29 '19 at 08:57