How can I strip the HTML from a string in JavaScript?
4 Answers
cleanText = strInputCode.replace(/<\/?[^>]+(>|$)/g, "");
Distilled from this website (web.achive).
This regex looks for <
, an optional slash /
, one or more characters that are not >
, then either >
or $
(the end of the line)
Examples:
'<div>Hello</div>' ==> 'Hello'
^^^^^ ^^^^^^
'Unterminated Tag <b' ==> 'Unterminated Tag '
^^
But it is not bulletproof:
'If you are < 13 you cannot register' ==> 'If you are '
^^^^^^^^^^^^^^^^^^^^^^^^
'<div data="score > 42">Hello</div>' ==> ' 42">Hello'
^^^^^^^^^^^^^^^^^^ ^^^^^^
If someone is trying to break your application, this regex will not protect you. It should only be used if you already know the format of your input. As other knowledgable and mostly sane people have pointed out, to safely strip tags, you must use a parser.
If you do not have acccess to a convenient parser like the DOM, and you cannot trust your input to be in the right format, you may be better off using a package like sanitize-html, and also other sanitizers are available.
- 5,853
- 2
- 28
- 38
-
35Sorry, but that would break `` – f.ardelian Feb 15 '11 at 10:56
-
121@f.ardelian people who make a hobby out of breaking the ill-use of regular expressions for parsing general HTML are great. It is a great hobby. – Ziggy May 07 '13 at 18:39
-
3@Ziggy: That sounds an awful lot like sarcasm... – f.ardelian May 07 '13 at 22:31
-
16@f.ardelian no! Really! Every time I read one of these comment threads I get a little thrill. "Ho ho ho," I think "b\" src=\"a_b.gif\" />, so clever!" – Ziggy May 08 '13 at 05:28
-
31@f.ardelian That would be buggy html, it had to be – peterh Jan 26 '15 at 11:55
-
1Not in HTML5, the syntax can be different. @peterh – seanlevan Mar 14 '15 at 19:16
-
13using reg is not good approach http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Sara Jun 01 '16 at 17:39
-
1Could we improve it with `|$)` ? – BM- Jul 17 '17 at 07:30
-
this code will remove `<18` as well while `<18` is not a tag, it just a string – Mher Aghabalyan Apr 01 '20 at 22:44
-
could somebody please what his regex eliminates and what it keeps? It works great for my needs so need to understand. – newdeveloper Jul 30 '20 at 17:41
-
@newdeveloper explained what this regex does in the answer body. – ReactiveRaven Aug 05 '20 at 16:34
Using the browser's parser is the probably the best bet in current browsers. The following will work, with the following caveats:
- Your HTML is valid within a
<div>
element. HTML contained within<body>
or<html>
or<head>
tags is not valid within a<div>
and may therefore not be parsed correctly. textContent
(the DOM standard property) andinnerText
(non-standard) properties are not identical. For example,textContent
will include text within a<script>
element whileinnerText
will not (in most browsers). This only affects IE <=8, which is the only major browser not to supporttextContent
.- The HTML does not contain
<script>
elements. - The HTML is not
null
- The HTML comes from a trusted source. Using this with arbitrary HTML allows arbitrary untrusted JavaScript to be executed. This example is from a comment by Mike Samuel on the duplicate question:
<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>
Code:
var html = "<p>Some HTML</p>";
var div = document.createElement("div");
div.innerHTML = html;
var text = div.textContent || div.innerText || "";
-
Nice answer, I didn't know about `textContent`. How many browsers do `textContent` + `innerText` cover? BTW, I've edited my answer to include *the jQuery way*. – Felix Feb 15 '11 at 11:04
-
@Felix: All major browsers have at least one of `textContent` and `innerText`. – Tim Down Feb 15 '11 at 11:05
-
4Doesn't work when the string contains something like . Then it crashes with "illegal token at" etc.. – Till Aug 19 '12 at 01:04
-
2Good caveats. In case it is not already clear I wanted to add that Firefox will crash on `div.innerHTML = html` if the value of `html` is `NULL`. Worse, it won't properly report the error (instead says parent function has `TypeError`). Chrome/IE do not crash. – Ryan Rapp Jan 24 '13 at 22:20
-
4SECURITY ISSUE ... This could be vulnerable as you're setting div.innerHTML ... i'm sure you don't wanted to get some unwanted script executed. ... manual cleanup would be cool. – Khizer Ali Aug 09 '16 at 11:44
-
-
3Elegant solution, but isn't universal. It doesn't work if you use it on node server because of the document dependency – Harijoe Apr 14 '17 at 10:17
-
-
1
-
-
Literally it is not. But having contents of two paragraphs becoming one word makes no sense. – eomeroff Jan 20 '20 at 10:47
-
@eomeroff: Whether it makes sense depends on the context. What input do you want to accept and what do you require the output to be? – Tim Down Jan 20 '20 at 15:05
-
@TimDown the one that closely corresponds to what is rendered with two
tags. For example, you take content from a rich text editor, remove the tags and past it to notepad. Should have the same spaces or/and line breaks.
– eomeroff Jan 20 '20 at 15:25
var html = "<p>Hello, <b>World</b>";
var div = document.createElement("div");
div.innerHTML = html;
alert(div.innerText); // Hello, World
That pretty much the best way of doing it, you're letting the browser do what it does best -- parse HTML.
Edit: As noted in the comments below, this is not the most cross-browser solution. The most cross-browser solution would be to recursively go through all the children of the element and concatenate all text nodes that you find. However, if you're using jQuery, it already does it for you:
alert($("<p>Hello, <b>World</b></p>").text());
Check out the text method.
- 84,032
- 41
- 145
- 163
-
3
-
9A concise jQuery could look like: `var html = "test"; var text = $("").html(html).text();` Using `$("")` lets you reuse the same element and less memory for consecutive calls or for loops. – Sukima Jan 04 '12 at 21:14
-
2
-
1and check out the text method for `var txt = "
my line
my other line
some other text"; alert($(txt).text();` where you don't proxy the string within a dom node. 3 lines in, 2 lines out. – frumbert Oct 17 '12 at 02:49 -
I like the jQuery solution because it is not vulnerable to code injection, as far as I know. – mareoraft Jan 21 '17 at 16:07
-
-
-
JS injection is possible if the HTML text is comming from unknown sources. – FleMo Aug 05 '20 at 09:37
-
The jQuery solution is the best for all of us (most of us, I guess) who already use it almost everywhere. Just keep in mind that if the string is in a variable, you will have to insert it in an element, e.g. `let text = $(\`${html_fragment}\`)`. – Francesco Marchetti-Stasi Dec 07 '20 at 16:21
I know this question has an accepted answer, but I feel that it doesn't work in all cases.
For completeness and since I spent too much time on this, here is what we did: we ended up using a function from php.js (which is a pretty nice library for those more familiar with PHP but also doing a little JavaScript every now and then):
http://phpjs.org/functions/strip_tags:535
It seemed to be the only piece of JavaScript code which successfully dealt with all the different kinds of input I stuffed into my application. That is, without breaking it – see my comments about the <script />
tag above.
- 21,590
- 4
- 55
- 86
-
1^ this, definitely better than the accepted answer for Chrome 30.0 and above – ebt Oct 04 '13 at 20:03
-
Works nicely on server-side without DOM support, e.g. Google Apps Script. – Mogsdad Dec 03 '14 at 16:08
-
1If you use the allowed param you are vulnerable to XSS: `stripTags('
mytext
', '')` returns `
mytext
` – Chris Cinelli Feb 20 '16 at 01:23 -
1