The point here is that in some cases, there will be repeating substrings that include shorter repeated substrings. So, to match the longer ones, you would use
(\b.+\b)\1\b
(see the regex demo) and for those to find shorter substrings, I'd rely on lazy dot matching:
(\b.+?\b)\1\b
See this regex demo. The replacement string will be \1
- the backreference to the captured part matched first with the grouping construct (...)
.
You need a PCRE regex to make it work, since there are documented issues with matching multiple word boundaries with gsub
(so, add perl=T
argument).
POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"
). Use perl = TRUE
for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).
Note that in case your repeated substrings can span across multiple lines, you can use the PCRE regex with the DOTALL modifier (?s)
at the start of the pattern (so that a .
could also match a newline symbol).
So, the R code would look like
gsub("(?s)(\\b.+\\b)\\1\\b", "\\1", s, perl=T)
or
gsub("(?s)(\\b.+?\\b)\\1\\b", "\\1", s, perl=T)
See the IDEONE demo:
text <- "are blue and then and then more and then and then more very bright"
gsub("(?s)(\\b.+?\\b)\\1\\b", "\\1", text, perl=T) ## shorter repeated substrings
## [1] "are blue and then more and then more very bright"
gsub("(?s)(\\b.+\\b)\\1\\b", "\\1", text, perl=T) ## longer repeated substrings
## [1] "are blue and then and then more very bright"