3

(In R gsub(),) I need to capture the four words occurring after a particular phrase in a bigger string. Building on the wisdom offered here, I came up with: ^.*\\b(particular phrase)\\W+(\\w+\\W+\\w+\\W+\\w+\\W+\\w+).*$

For example:

this_txt <- "Blah blah particular phrase Extract These Words Please for the blah blah. Ignore blah this other stuff blah blah, blah."
this_pattern <- "^.*\\b(particular phrase)\\W+(\\w+\\W+\\w+\\W+\\w+\\W+\\w+).*$"
gsub(this_pattern, "\\2", this_txt, ignore.case = T)
# [1] "Extract These Words Please"

But the repetition of \\w+\\W+ in the pattern is pretty unseemly. Surely there is a better way. I thought ^.*\\b(particular phrase)\\W+(\\w+\\W+){4}.*$ might work, but it doesn't.

ben
  • 805
  • 3
  • 14
  • 26
  • 1
    [It does work](https://regex101.com/r/soTiU3/1), but does end with a `\W`. You just needed to capture with another group and change the inner to a non capturing group. – bobble bubble Jun 12 '19 at 16:51
  • `.*\\bparticular phrase((?:\\W+\\w+){4}).*` should work – Onyambu Jun 12 '19 at 17:10

1 Answers1

3

You may use

^.*\b(particular phrase)\W+((?:\w+\W+){3}\w+).*$

In R,

this_pattern <- "^.*\\b(particular phrase)\\W+((?:\\w+\\W+){3}\\w+).*$"

See the regex demo

(\w+\W+\w+\W+\w+\W+\w+) is replaced with ((?:\w+\W+){3}\w+). The ((?:\w+\W+){3}\w+) is a capturing group ((...)) that contains two subpatterns:

  • (?:\w+\W+){3} - a non-capturing group matching three repetitions of
    • \w+ - 1 or more word chars
    • \W+ - 1 or more non-word chars
  • \w+ - 1 or mor word chars.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397