Anonymize html with regex

Question

I'm trying to anonymize a HTML string with regex, for an SQL query.

(?<!\<)[^<>\s](?!\>)

<p><em>Hi [User</em></p>
<p><em>Tack f&ouml;r visat intresse.</em></p>
<p><em>Good luck!</em><em>&nbsp;</em></p>
<p><em>Sincerely</em></p>

<p><em>nn nnnnn</nm></p>
<p><em>nnnn nnnnnnnn nnnnn nnnnnnnnn</nm></p>
<p><em>nnnn nnnnn</nm><em>nnnnnn</nm></p>
<p><em>nnnnnnnnn</nm></p>

The plan was to replace every character that is not within <>, with an n. It almost works, but in my example it replaces the e in </em>. Not sure why and how to fix that.

How can I adjust the regex to not replace the e in the example?

What language are you implementing this in? – CertainPerformance May 29 '19 at 08:55 — CertainPerformance, May 29 '19 at 08:55

CertainPerformance · Accepted Answer · 2019-05-29T10:11:36.707

5

Negative lookahead for [^<>]*> instead of just >, to ensure that the current position is not followed by a > before any other angle brackets (because that would indicate you're currently inside a tag).

This also means that you can drop the lookbehind:

[^<>\s](?![^<>]*>)
          ^^^^^^

https://regex101.com/r/QWt1E1/3

Still, it would be better to parse the HTML using an HTML parser, if at all possible

edited May 29 '19 at 10:11

answered May 29 '19 at 08:54

CertainPerformance

260,466
31
181
209

Amazing, thank you! Needed it to be regex, to run it in an SQL query against a postgresql database. – Znarkus May 29 '19 at 08:57

Anonymize html with regex

1 Answers1