How to say (\w+\W+) times 4 in regex (R gsub)

Question

(In R gsub(),) I need to capture the four words occurring after a particular phrase in a bigger string. Building on the wisdom offered here, I came up with: ^.*\\b(particular phrase)\\W+(\\w+\\W+\\w+\\W+\\w+\\W+\\w+).*$

For example:

this_txt <- "Blah blah particular phrase Extract These Words Please for the blah blah. Ignore blah this other stuff blah blah, blah."
this_pattern <- "^.*\\b(particular phrase)\\W+(\\w+\\W+\\w+\\W+\\w+\\W+\\w+).*$"
gsub(this_pattern, "\\2", this_txt, ignore.case = T)
# [1] "Extract These Words Please"

But the repetition of \\w+\\W+ in the pattern is pretty unseemly. Surely there is a better way. I thought ^.*\\b(particular phrase)\\W+(\\w+\\W+){4}.*$ might work, but it doesn't.

[It does work](https://regex101.com/r/soTiU3/1), but does end with a `\W`. You just needed to capture with another group and change the inner to a non capturing group. — bobble bubble, Jun 12 '19 at 16:51

score 3 · Accepted Answer · answered Jun 12 '19 at 16:36

You may use

^.*\b(particular phrase)\W+((?:\w+\W+){3}\w+).*$

In R,

this_pattern <- "^.*\\b(particular phrase)\\W+((?:\\w+\\W+){3}\\w+).*$"

See the regex demo

(\w+\W+\w+\W+\w+\W+\w+) is replaced with ((?:\w+\W+){3}\w+). The ((?:\w+\W+){3}\w+) is a capturing group ((...)) that contains two subpatterns:

(?:\w+\W+){3} - a non-capturing group matching three repetitions of
- \w+ - 1 or more word chars
- \W+ - 1 or more non-word chars
\w+ - 1 or mor word chars.

How to say (\w+\W+) times 4 in regex (R gsub)

1 Answers1