5

My question is a direct extension of this earlier question about detecting consecutive words (unigrams) in a string.

In the previous question,

Not that that is related

could be detected via this regex: \b(\w+)\s+\1\b

Here, I want to detect consecutive bigrams (pairs of words):

are blue and then and then very bright

Ideally, I also want to know how to replace the detected pattern (duplicate) by a single element, so as to obtain in the end:

are blue and then very bright

(for this application, if it matters, I am using gsub in R)

Community
  • 1
  • 1
Antoine
  • 1,385
  • 4
  • 19
  • 44
  • There can be edge cases here... What is your exact criteria? Try [`(\b.+\b)\1\b`](https://regex101.com/r/eF5tF2/3). `gsub("(\\b.+\\b)\\1\\b", "\\1", s, perl=T)`. – Wiktor Stribiżew Apr 20 '16 at 15:20
  • thank you for your interest in my question. What do you mean by `edge cases`? – Antoine Apr 20 '16 at 15:21
  • it seems that your proposed solution works well... By `edge cases` do you mean that in some situations it could behave unexpectedly? – Antoine Apr 20 '16 at 15:22
  • @WiktorStribiżew Your solution will not work [in all cases](https://regex101.com/r/lJ5wC6/1) – Kaspar Lee Apr 20 '16 at 15:23
  • @WiktorStribiżew [Still does not work...](https://regex101.com/r/jL7uI7/1) – Kaspar Lee Apr 20 '16 at 15:24
  • The thing is that if you have longer and shorter repeating substrings, a lazy matching pattern can be used. It depends. See [`(\b.+?\b)\1\b`](https://regex101.com/r/eF5tF2/4) - check how the matches differ by removing and adding the question mark. Please precise. – Wiktor Stribiżew Apr 20 '16 at 15:25
  • @WiktorStribiżew when I remove the question mark it captures longer repeated patterns. In my case I am only interested in detecting repeated bigrams, so I'll leave the question mark. – Antoine Apr 20 '16 at 15:29
  • One last comment: if your repeating substrings span across multiple *lines*, and you will use a `perl=T`, add `(?s)` at the start of the pattern. If you do not use `perl=T`, do not add `(?s)` as in TRE, `.` matches any character including a newline. – Wiktor Stribiżew Apr 20 '16 at 15:38
  • @WiktorStribiżew ok thanks. I don't think my input text spans across multiple lines but I will check. I am leaving now but I will accept your answer when I come back – Antoine Apr 20 '16 at 15:49

2 Answers2

3

Try the following RegEx:

(\b.+?\b)\1\b

The RegEx will capture a word boundary, followed by the data and then another word boundary. The \1 will refer to what was captured, and select that again. It will then check for a word boundary the the end to prevent a and and z zoo from being selected

As for the replace, use \1. This will contain the data from the 1st Capture Group (the first part of the bigram), and that first part will be used to replace the whole thing.

Live Demo on Regex101

Kaspar Lee
  • 4,918
  • 4
  • 27
  • 52
  • @WiktorStribiżew you were the first to answer via your comments. The current answer is almost exactly the same thing you proposed (only difference I can see is `\b` at the end), so if you post an answer I will accept it since you were the first – Antoine Apr 20 '16 at 15:32
  • @WiktorStribiżew actually no, there is not even a difference between what you proposed in your comments and the current answer. So if you post an answer I will definitely accept it – Antoine Apr 20 '16 at 15:34
  • @Antoine: It is ok, I do not like posting duplicates of existing answers. – Wiktor Stribiżew Apr 20 '16 at 15:36
  • @WiktorStribiżew yes but it is even worse than leaving a thread unclosed. I would feel uncomfortable accepting another answer since you were the first. Too bad I cannot accept a comment as an answer – Antoine Apr 20 '16 at 15:37
  • @WiktorStribiżew You did answer first, I'll delete this if you want...? – Kaspar Lee Apr 20 '16 at 15:38
  • 1
    This is embarassing. I have never been in such a situation. – Wiktor Stribiżew Apr 20 '16 at 15:40
  • guys, I am not an expert about SO rules but think it is fine to leave both contributions. However, since @WiktorStribiżew answered first I will accept his answer if he provides one. – Antoine Apr 20 '16 at 15:40
  • 1
    Ok, I will write a comprehensive R-related answer. No need to delete this one. – Wiktor Stribiżew Apr 20 '16 at 15:40
  • 1
    great I think this settles it – Antoine Apr 20 '16 at 15:41
3

The point here is that in some cases, there will be repeating substrings that include shorter repeated substrings. So, to match the longer ones, you would use

(\b.+\b)\1\b

(see the regex demo) and for those to find shorter substrings, I'd rely on lazy dot matching:

(\b.+?\b)\1\b

See this regex demo. The replacement string will be \1 - the backreference to the captured part matched first with the grouping construct (...).

You need a PCRE regex to make it work, since there are documented issues with matching multiple word boundaries with gsub (so, add perl=T argument).

POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

Note that in case your repeated substrings can span across multiple lines, you can use the PCRE regex with the DOTALL modifier (?s) at the start of the pattern (so that a . could also match a newline symbol).

So, the R code would look like

gsub("(?s)(\\b.+\\b)\\1\\b", "\\1", s, perl=T)

or

gsub("(?s)(\\b.+?\\b)\\1\\b", "\\1", s, perl=T)

See the IDEONE demo:

text <- "are blue and then and then more and then and then more very bright"
gsub("(?s)(\\b.+?\\b)\\1\\b", "\\1", text, perl=T) ## shorter repeated substrings
## [1] "are blue and then more and then more very bright"
gsub("(?s)(\\b.+\\b)\\1\\b", "\\1", text, perl=T) ## longer repeated substrings
## [1] "are blue and then and then more very bright"
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397