Remove all specific html tags using gsub R

Question

I have a string like

txt<-"text text text <div><div><script>xxxx</script></div><scrip>yyyyy</script>text </div><script>zzzzzz</script>"

I want to delet all script tags and its content .

"text text text <div><div></div>text </div>"

i have tried

gsub("<script.*?>(.*)<\\/script>", "", txt)

Could you give us a good tutorial of to learn fast regular expression for R

Thanks in advance

So, I think your question is dublicate of http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075?s=1|2.9568#22944075 — jogo, Nov 28 '15 at 11:19
I don't think you should be using regex at all on html. I would recommend `removeNodes()` from the XML package. — Rich Scriven, Nov 28 '15 at 21:40

score 2 · Accepted Answer · answered Nov 28 '15 at 13:55

Your first try with greedy dot matching was doomed to fail because greedy matching does not ensure the shortest match (well, lazy one does not either), and that would just match all necessary texts in-between.

The latest attempt at using <script>[^</script>^<script>]+</script> is also not valid as [^</script>^<script>]+ matches 1 or more letters other than <, /, s, c, r, i, p, t, >, ^. Clearly that is not what you need.

Abstracting from the problem itself, it is possible to process any plain text files with regex removing large chunks from the text.

When we need to match a substring between some non-identical markers (or delimiters), we may use a unroll-the-loop technique with a Perl-like regex that supports lookaheads.

Here is the working code that should work with plain texts of any size:

txt<-"text text text <div><div><script>xxxx</script></div><script>yyyyy</script>text </div><script>zzzzzz</script>"
gsub("<script\\b[^<]*>[^<]*(?:<(?!/script>)[^<]*)*</script>", "", txt, perl=T)
## [1] "text text text <div><div></div>text </div>"

The regex demo can be seen here and here is the IDEONE demo.

Basically, that matches:

<script\\b[^<]*> - any opening <script> tag even with attributes inside (not that < cannot appear in the HTML attributes, thus [^<]* is safer to use than [^<>]* or [^>]*)
[^<]*(?:<(?!/script>)[^<]*)* - unrolled (?s).*? construct that matches any text but </script>
</script> - closing </script> tag

score 0 · Answer 2 · answered Nov 28 '15 at 11:33

0

I think i found it

gsub("<script>[^</script>^<script>]+</script>", "", txt)

answered Nov 28 '15 at 11:33

SalimK

330
2
16

While this code may answer the question, it would be better to include some context, explaining how it works and when to use it. Code-only answers are not useful in the long run. – Bono Nov 28 '15 at 12:28

Remove all specific html tags using gsub R

2 Answers2