Your first try with greedy dot matching was doomed to fail because greedy matching does not ensure the shortest match (well, lazy one does not either), and that would just match all necessary texts in-between.
The latest attempt at using <script>[^</script>^<script>]+</script>
is also not valid as [^</script>^<script>]+
matches 1 or more letters other than <
, /
, s
, c
, r
, i
, p
, t
, >
, ^
. Clearly that is not what you need.
Abstracting from the problem itself, it is possible to process any plain text files with regex removing large chunks from the text.
When we need to match a substring between some non-identical markers (or delimiters), we may use a unroll-the-loop technique with a Perl-like regex that supports lookaheads.
Here is the working code that should work with plain texts of any size:
txt<-"text text text <div><div><script>xxxx</script></div><script>yyyyy</script>text </div><script>zzzzzz</script>"
gsub("<script\\b[^<]*>[^<]*(?:<(?!/script>)[^<]*)*</script>", "", txt, perl=T)
## [1] "text text text <div><div></div>text </div>"
The regex demo can be seen here and here is the IDEONE demo.
Basically, that matches:
<script\\b[^<]*>
- any opening <script>
tag even with attributes inside (not that <
cannot appear in the HTML attributes, thus [^<]*
is safer to use than [^<>]*
or [^>]*
)
[^<]*(?:<(?!/script>)[^<]*)*
- unrolled (?s).*?
construct that matches any text but </script>
</script>
- closing </script>
tag