1

I have a string like

txt<-"text text text <div><div><script>xxxx</script></div><scrip>yyyyy</script>text </div><script>zzzzzz</script>"

I want to delet all script tags and its content .

"text text text <div><div></div>text </div>"

i have tried

gsub("<script.*?>(.*)<\\/script>", "", txt)

Could you give us a good tutorial of to learn fast regular expression for R

Thanks in advance

SalimK
  • 330
  • 2
  • 16

2 Answers2

2

Your first try with greedy dot matching was doomed to fail because greedy matching does not ensure the shortest match (well, lazy one does not either), and that would just match all necessary texts in-between.

The latest attempt at using <script>[^</script>^<script>]+</script> is also not valid as [^</script>^<script>]+ matches 1 or more letters other than <, /, s, c, r, i, p, t, >, ^. Clearly that is not what you need.

Abstracting from the problem itself, it is possible to process any plain text files with regex removing large chunks from the text.

When we need to match a substring between some non-identical markers (or delimiters), we may use a unroll-the-loop technique with a Perl-like regex that supports lookaheads.

Here is the working code that should work with plain texts of any size:

txt<-"text text text <div><div><script>xxxx</script></div><script>yyyyy</script>text </div><script>zzzzzz</script>"
gsub("<script\\b[^<]*>[^<]*(?:<(?!/script>)[^<]*)*</script>", "", txt, perl=T)
## [1] "text text text <div><div></div>text </div>"

The regex demo can be seen here and here is the IDEONE demo.

Basically, that matches:

  • <script\\b[^<]*> - any opening <script> tag even with attributes inside (not that < cannot appear in the HTML attributes, thus [^<]* is safer to use than [^<>]* or [^>]*)
  • [^<]*(?:<(?!/script>)[^<]*)* - unrolled (?s).*? construct that matches any text but </script>
  • </script> - closing </script> tag
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
0

I think i found it

gsub("<script>[^</script>^<script>]+</script>", "", txt)
SalimK
  • 330
  • 2
  • 16
  • While this code may answer the question, it would be better to include some context, explaining how it works and when to use it. Code-only answers are not useful in the long run. – Bono Nov 28 '15 at 12:28