101

I'm writing a Chrome extension that involves doing a lot of the following job: sanitizing strings that might contain HTML tags, by converting <, > and & to &lt;, &gt; and &amp;, respectively.

(In other words, the same as PHP's htmlspecialchars(str, ENT_NOQUOTES) – I don't think there's any real need to convert double-quote characters.)

This is the fastest function I have found so far:

function safe_tags(str) {
    return str.replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;') ;
}

But there's still a big lag when I have to run a few thousand strings through it in one go.

Can anyone improve on this? It's mostly for strings between 10 and 150 characters, if that makes a difference.

(One idea I had was not to bother encoding the greater-than sign – would there be any real danger with that?)

Lightness Races in Orbit
  • 358,771
  • 68
  • 593
  • 989
callum
  • 26,180
  • 30
  • 91
  • 142
  • 2
    Why? In most cases that you want to do this, you want to insert the data into the DOM, in which case you should forget about escaping it and just make a textNode from it. – Quentin Mar 31 '11 at 11:30
  • 1
    @David Dorward: perhaps he wanted to sanitize POST data, and the server does not round-trip the data correctly. – Lie Ryan Mar 31 '11 at 11:35
  • 4
    @Lie — if so, then the solution is "For Pete's sake, fix the server as you have a big XSS hole" – Quentin Mar 31 '11 at 13:12
  • 2
    @David Dorward: it is possible that the case is he do not have control over the server. I've been into such situation recently where I was writing a greasemonkey script to workaround a couple of things I don't like in my university's website; I had to do a POST on a server that I do not have control to and sanitize POST data using javascript (since the raw data comes from a rich textbox, and so has heaps of html tags which does not do round trip on the server). The web admin was ignoring my request for them to fix the website, so I had no other choice. – Lie Ryan Mar 31 '11 at 13:40
  • 1
    I have a use-case where I need to display an error message in a div. The error message can contain HTML and newlines. I want to escape the HTML and replace the newlines with
    . Then put the result into a div for display.
    – mozey Jul 29 '13 at 09:09

12 Answers12

111

Here's one way you can do this:

var escape = document.createElement('textarea');
function escapeHTML(html) {
    escape.textContent = html;
    return escape.innerHTML;
}

function unescapeHTML(html) {
    escape.innerHTML = html;
    return escape.textContent;
}

Here's a demo.

Kevin Reilly
  • 5,632
  • 2
  • 21
  • 17
Web_Designer
  • 64,966
  • 87
  • 197
  • 254
  • Redesigned the demo. Here's a fullscreen version: http://jsfiddle.net/Daniel_Hug/qPUEX/show/light – Web_Designer May 02 '13 at 15:25
  • 14
    Not sure how/what/why - but this is genius. – rob_james Jun 18 '14 at 12:12
  • @Web_Designer can you explain this magic voodoo? Does it work in all browsers? – degenerate Nov 30 '15 at 20:02
  • 4
    Looks like it is leveraging the TextArea element's existing code for escaping literal text. Very nice, I think this little trick is going to find another home. – Ajax Jan 04 '16 at 08:41
  • escape is now deprecated - see here https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape I used encodeURIComponent instead: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent – jazkat Jul 03 '17 at 19:21
  • 3
    @jazkat I'm not using that function. The escape variable I use, I define myself in the example. – Web_Designer Jul 04 '17 at 00:08
  • 2
    but does this lose white space etc. – Andrew Jan 14 '18 at 19:41
  • It may not be useful if you have to show triangular braces as it will add an extra "&" – sagar Aug 20 '19 at 11:23
88

You could try passing a callback function to perform the replacement:

var tagsToReplace = {
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;'
};

function replaceTag(tag) {
    return tagsToReplace[tag] || tag;
}

function safe_tags_replace(str) {
    return str.replace(/[&<>]/g, replaceTag);
}

Here is a performance test: http://jsperf.com/encode-html-entities to compare with calling the replace function repeatedly, and using the DOM method proposed by Dmitrij.

Your way seems to be faster...

Why do you need it, though?

Martijn
  • 12,254
  • 3
  • 46
  • 58
  • 2
    There is no need to escape `>`. –  Mar 10 '13 at 13:50
  • 7
    Actually if you put the escaped value in an html element's attribute, you need to escape the > symbol. Otherwise it would break the tag for that html element. – Zlatin Zlatev Oct 07 '13 at 15:42
  • 1
    In normal text escaped characters are rare. It's better to call replace only when needed, if you care about max speed: `if (/[<>&"]/.test(str) { ... }` – Vitaly Oct 26 '14 at 04:22
  • @LightnessRacesinOrbit can you give an example where something could go wrong if you didn't escape `>`? – callum Jul 20 '15 at 15:24
  • 4
    @callum: No. I am not interested in enumerating cases in which I think "something could go wrong" (not least because it's the unexpected/forgotten cases that'll hurt you, and when you least expect it at that). I am interested in coding to standards (so the unexpected/forgotten cases can't hurt you _by definition_). I can't stress how important this is. `>` is a special character in HTML, so escape it. Simple as that. :) – Lightness Races in Orbit Jul 20 '15 at 15:30
  • @LightnessRacesinOrbit I stick to standards as a default (for obvious maintenance reasons) but blindly following them to the letter 100% of the time is an opportunity cost. There are things that are nonstandard that are awesome. Not saying this one is a strong example. But dogma isn't always best in programming. You're interested in writing spec-compliant code, I'm interested in making real web browsers do real things. It's legitimate for me to ask you to back up your claim with something more than "because standards". – callum Jul 20 '15 at 17:18
  • @callum: No, it's not, really. Your argument against following standards blindly is all well and good, and I support it, but you should follow standards _until you have a good reason not to_, not the other way around. Hence your question is vacuous. You should explain to me an example of where `>` must not be escaped, then explain how it's relevant to the OP here. – Lightness Races in Orbit Jul 20 '15 at 17:27
  • 4
    @LightnessRacesinOrbit It's relevant because the question is what is the fastest possible method. If it's possible to skip the `>` replacement, that would make it faster. – callum Jul 20 '15 at 17:37
  • 1
    @callum: Now you're talking. :) – Lightness Races in Orbit Jul 20 '15 at 17:41
  • How to escape '"'? Am trying this '"': '\"' , doesn't work – nickalchemist Sep 23 '15 at 06:47
  • @nodeninja: a comment is not the place to ask a new question. You should ask that in a new question. – Martijn Sep 24 '15 at 11:39
32

Martijn's method as a prototype function:

String.prototype.escape = function() {
    var tagsToReplace = {
        '&': '&amp;',
        '<': '&lt;',
        '>': '&gt;'
    };
    return this.replace(/[&<>]/g, function(tag) {
        return tagsToReplace[tag] || tag;
    });
};

var a = "<abc>";
var b = a.escape(); // "&lt;abc&gt;"
Aram Kocharyan
  • 19,179
  • 11
  • 69
  • 93
  • 12
    Add to `String` like this it should be **escapeHtml** since it's not an escaping for a String in general. That is `String.escapeHtml` is correct, but `String.escape` raises the question, "escape for what?" – Lawrence Dol Mar 13 '14 at 03:12
  • 3
    Yeah good idea. I've moved away from extending the prototype these days to avoid conflicts. – Aram Kocharyan Mar 13 '14 at 23:34
  • 1
    If your browser has support for Symbol, you could use that instead to avoid polluting the string-key namespace. var escape = new Symbol("escape"); String.prototype[escape] = function(){ ... }; "text"[escape](); – Ajax Jan 04 '16 at 08:58
  • plus one for the example. – Timo Sep 30 '20 at 18:12
14

The fastest method is:

function escapeHTML(html) {
    return document.createElement('div').appendChild(document.createTextNode(html)).parentNode.innerHTML;
}

This method is about twice faster than the methods based on 'replace', see http://jsperf.com/htmlencoderegex/35 .

Source: https://stackoverflow.com/a/17546215/698168

Community
  • 1
  • 1
Julien Kronegg
  • 4,282
  • 42
  • 53
14

An even quicker/shorter solution is:

escaped = new Option(html).innerHTML

This is related to some weird vestige of JavaScript whereby the Option element retains a constructor that does this sort of escaping automatically.

Credit to https://github.com/jasonmoo/t.js/blob/master/t.js

Todd
  • 141
  • 1
  • 2
  • 1
    Neat one-liner but the [slowest method](https://jsperf.com/htmlentityencode/1) after regex. Also, the text here can have whitespace stripped, according to the [spec](https://www.w3.org/TR/2012/WD-html5-20121025/the-option-element.html#dom-option) – ShortFuse Jan 06 '20 at 19:25
  • Note that @ShortFuse's "slowest method" link makes my system run out of RAM (with ~6GB free) and firefox seems to stop allocating just before it's out of memory so instead of killing the offending process, linux will sit there and let you do a hard power off. – Luc Jul 11 '20 at 09:09
11

The AngularJS source code also has a version inside of angular-sanitize.js.

var SURROGATE_PAIR_REGEXP = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g,
    // Match everything outside of normal chars and " (quote character)
    NON_ALPHANUMERIC_REGEXP = /([^\#-~| |!])/g;
/**
 * Escapes all potentially dangerous characters, so that the
 * resulting string can be safely inserted into attribute or
 * element text.
 * @param value
 * @returns {string} escaped text
 */
function encodeEntities(value) {
  return value.
    replace(/&/g, '&amp;').
    replace(SURROGATE_PAIR_REGEXP, function(value) {
      var hi = value.charCodeAt(0);
      var low = value.charCodeAt(1);
      return '&#' + (((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000) + ';';
    }).
    replace(NON_ALPHANUMERIC_REGEXP, function(value) {
      return '&#' + value.charCodeAt(0) + ';';
    }).
    replace(/</g, '&lt;').
    replace(/>/g, '&gt;');
}
Kevin Hakanson
  • 38,937
  • 23
  • 119
  • 148
  • 1
    Wow, that non-alphanum regex is intense. I don't think the | in the expression is needed though. – Ajax Jan 04 '16 at 09:14
9

All-in-one script:

// HTML entities Encode/Decode

function htmlspecialchars(str) {
    var map = {
        "&": "&amp;",
        "<": "&lt;",
        ">": "&gt;",
        "\"": "&quot;",
        "'": "&#39;" // ' -> &apos; for XML only
    };
    return str.replace(/[&<>"']/g, function(m) { return map[m]; });
}
function htmlspecialchars_decode(str) {
    var map = {
        "&amp;": "&",
        "&lt;": "<",
        "&gt;": ">",
        "&quot;": "\"",
        "&#39;": "'"
    };
    return str.replace(/(&amp;|&lt;|&gt;|&quot;|&#39;)/g, function(m) { return map[m]; });
}
function htmlentities(str) {
    var textarea = document.createElement("textarea");
    textarea.innerHTML = str;
    return textarea.innerHTML;
}
function htmlentities_decode(str) {
    var textarea = document.createElement("textarea");
    textarea.innerHTML = str;
    return textarea.value;
}

http://pastebin.com/JGCVs0Ts

baptx
  • 2,297
  • 2
  • 25
  • 38
  • I didn't downvote, but all regex style replace will fail to encode unicode... So, anyone using a foreign language is going to be disappointed. The – Ajax Jan 04 '16 at 08:59
  • The regex works fine for me with a number of non-Latin Unicode characters. I wouldn't expect anything else. How do you think this wouldn't work? Are you thinking of single-byte codepages that require HTML entities? That's what the 3rd and 4th function are for, and explicitly not the 1st and second. I like the differentiation. – ygoe Feb 29 '16 at 17:30
  • @LonelyPixel I don't think he will see your comment if you don't mention him ("Only one additional user can be notified; the post owner will always be notified") – baptx Feb 29 '16 at 19:31
  • I didn't know targeted notifications exist at all. @Ajax please see my comment above. – ygoe Mar 01 '16 at 08:00
  • @LonelyPixel I see now. For some reason I didn't think there was a textarea style replacement in this answer. I was, indeed, thinking of double codepoint big unicode values, like Mandarin. I mean, it would be possible to make a regex smart enough, but when you look at the shortcuts that browser vendors can take, I would feel pretty good betting that textarea will be much faster (than a completely competent regex). Did someone post a benchmark on this answer? I swore I had seen one. – Ajax Mar 02 '16 at 02:41
3

function encode(r) {
  return r.replace(/[\x26\x0A\x3c\x3e\x22\x27]/g, function(r) {
 return "&#" + r.charCodeAt(0) + ";";
  });
}

test.value=encode('How to encode\nonly html tags &<>\'" nice & fast!');

/*
 \x26 is &ampersand (it has to be first),
 \x0A is newline,
 \x22 is ",
 \x27 is ',
 \x3c is <,
 \x3e is >
*/
<textarea id=test rows=11 cols=55>www.WHAK.com</textarea>
Dave Brown
  • 801
  • 9
  • 6
1

Martijn's method as single function with handling " mark (using in javascript) :

function escapeHTML(html) {
    var fn=function(tag) {
        var charsToReplace = {
            '&': '&amp;',
            '<': '&lt;',
            '>': '&gt;',
            '"': '&#34;'
        };
        return charsToReplace[tag] || tag;
    }
    return html.replace(/[&<>"]/g, fn);
}
iman
  • 5,359
  • 1
  • 17
  • 23
  • this solution I have also found in Vue framework https://github.com/vuejs/vue/blob/b51430f598b354ed60851bb62885539bd25de3d8/src/platforms/web/server/util.js#L50 – Luckylooke Feb 16 '21 at 18:12
1

I'm not entirely sure about speed, but if you are looking for simplicity I would suggest using the lodash/underscore escape function.

gilmatic
  • 1,164
  • 11
  • 16
0

I'll add XMLSerializer to the pile. It provides the fastest result without using any object caching (not on the serializer, nor on the Text node).

function serializeTextNode(text) {
  return new XMLSerializer().serializeToString(document.createTextNode(text));
}

The added bonus is that it supports attributes which is serialized differently than text nodes:

function serializeAttributeValue(value) {
  const attr = document.createAttribute('a');
  attr.value = value;
  return new XMLSerializer().serializeToString(attr);
}

You can see what it's actually replacing by checking the spec, both for text nodes and for attribute values. The full documentation has more node types, but the concept is the same.

As for performance, it's the fastest when not cached. When you do allow caching, then calling innerHTML on an HTMLElement with a child Text node is fastest. Regex would be slowest (as proven by other comments). Of course, XMLSerializer could be faster on other browsers, but in my (limited) testing, a innerHTML is fastest.


Fastest single line:

new XMLSerializer().serializeToString(document.createTextNode(text));

Fastest with caching:

const cachedElementParent = document.createElement('div');
const cachedChildTextNode = document.createTextNode('');
cachedElementParent.appendChild(cachedChildTextNode);

function serializeTextNode(text) {
  cachedChildTextNode.nodeValue = text;
  return cachedElementParent.innerHTML;
}

https://jsperf.com/htmlentityencode/1

ShortFuse
  • 4,244
  • 1
  • 27
  • 29
-2

A bit late to the show, but what's wrong with using encodeURIComponent() and decodeURIComponent()?

suncat100
  • 1,930
  • 1
  • 13
  • 22
  • 1
    Those do something completely unrelated – callum Apr 04 '18 at 16:22
  • 2
    Perhaps the biggest abuse of the word "completely" I have ever heard. For example, in relation to the main topic question, it could be used to decode a html string (obviously for some kinda storage reason), regardless of html tags, and then easily encode it back to html again when and if required. – suncat100 Apr 05 '18 at 17:27