2

I have strings like these:

test <- c("oh i mean well i do n't know well he 's like oh",
          "yeah so well he did n't say oh he said f** well you know what he 's like",
          "oh you know well why well maybe he thought oh well good", 
          "oh my god well what the hell did he oh you know")

I'd like to match all word sequences starting with oh and ending with well and, the inverse, starting with well and ending with oh. This use of str_extract_all does match some of the target sequences but not all, because it is unable to iteratively match, that is, it does not start anew from each oh or well once it has consumed it in a match:

library(stringr)
strings <- unlist(str_extract_all(test, "\\boh\\b.*?\\bwell\\b|\\bwell\\b.*?\\boh\\b"))
[1] "oh i mean well"           "well he 's like oh"       "well he did n't say oh"   "oh you know well"        
[5] "well maybe he thought oh" "oh my god well" 

The complete output would be this:

[1] "oh i mean well"     "well he 's like oh"     "well he did n't say oh"     "oh he said f** well" 
[5] "oh you know well"  "oh well"   "well maybe he thought oh"     "oh my god well"
[9] "well what the hell did he oh" 
Chris Ruehlemann
  • 10,258
  • 2
  • 9
  • 18

2 Answers2

3

You can use str_extract_all one for oh...well and one for well...oh using your regex:

library(stringr)
unlist(c(str_extract_all(test, "\\boh\\b.*?\\bwell\\b")
       , str_extract_all(test, "\\bwell\\b.*?\\boh\\b")))
#[1] "oh i mean well"                       
#[2] "oh he said f** well"                  
#[3] "oh you know well"                     
#[4] "oh well"                              
#[5] "oh my god well"                       
#[6] "well i do n't know well he 's like oh"
#[7] "well he did n't say oh"               
#[8] "well why well maybe he thought oh"    
#[9] "well what the hell did he oh"         

or in case the shortest sequence:

unlist(c(str_extract_all(test, "\\boh\\b((?!\\boh\\b).)*?\\bwell\\b")
 , str_extract_all(test, "\\bwell\\b((?!\\bwell\\b).)*?\\boh\\b")))
#[1] "oh i mean well"               "oh he said f** well"         
#[3] "oh you know well"             "oh well"                     
#[5] "oh my god well"               "well he 's like oh"          
#[7] "well he did n't say oh"       "well maybe he thought oh"    
#[9] "well what the hell did he oh"

Data:

test <- c("oh i mean well i do n't know well he 's like oh",
          "yeah so well he did n't say oh he said f** well you know what he 's like",
          "oh you know well why well maybe he thought oh well good", 
          "oh my god well what the hell did he oh you know")
GKi
  • 20,626
  • 1
  • 11
  • 24
  • 1
    Thanks for your solution. I've never been so torn as to which answer to accept but have ultimately decided to accept @Wiktor's answer as it provides yet more insight into regex. – Chris Ruehlemann Jul 08 '20 at 13:18
1

You may use a stringr::str_match_all solution (because stringr::str_extract_all "loses" all captured substrings):

test <- c("oh i mean well i do n't know well he 's like oh",
"yeah so well he did n't say oh he said f** well you know what he 's like", 
"oh you know well why well maybe he thought oh well good",
"oh my god well what the hell did he oh you know")
res <- stringr::str_match_all(test, "(?=(\\boh\\b(?:(?!\\boh\\b).)*?\\bwell\\b|\\bwell\\b(?:(?!\\bwell\\b).)*?\\boh\\b))")

unlist(lapply(res, function(x) x[,-1]))

See an R demo online and the regex demo.

Details

  • (?= - start of a positive lookahead:
    • ( - start of a capturing group:
      • \boh\b(?:(?!\boh\b).)*?\bwell\b - oh whole word and then any 0+ chars, as few as possible that do not start a whole word oh up to the leftmost well whole word
      • | - or
      • \bwell\b(?:(?!\bwell\b).)*?\boh\b - well whole word and then any 0+ chars, as few as possible that do not start a whole word well up to the leftmost oh whole word
    • ) - end of the capturing group
  • ) - end of the positive lookahead.

Output:

[1] "oh i mean well"               "well he 's like oh"          
[3] "well he did n't say oh"       "oh he said f** well"         
[5] "oh you know well"             "well maybe he thought oh"    
[7] "oh well"                      "oh my god well"              
[9] "well what the hell did he oh"
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • Brilliant! Thanks a lot. One question if you don't mind: I never know what the meaning is of `(?:` - what is it? – Chris Ruehlemann Jul 08 '20 at 12:07
  • 1
    @ChrisRuehlemann `(?:...)` is a **[non-capturing group](https://stackoverflow.com/questions/3512471/)** used to group a sequence of patterns (to use alternatives, or quantify it) without storing the text captured in a separate memory slot. – Wiktor Stribiżew Jul 08 '20 at 12:09