Extracting specific parts of an input string using stringr package in R

Question

Basically, this is my input;

"a ~ b c d*e !r x"
"a ~ b c"
"a ~ b c d1 !r y",
"a ~ b c D !r z",
"a~b c d*e!r z"

and would desire this as my result;

"b c d*e"
"b c"
"b c d1"
"b c D"
"b c d*e"

The input represents (mixed) models that are built up of three groups, i.e. the dependent part (~) the fixed part and the random part (!r). I thought with capture groups it would be easy enough (example). The difficulty is the random part which is not always present.

I tried different things as you can see below and of course it possible to do this in two steps. However, I desire a (robust) regex one-liner - I feel that should be possible. I employed these different sources as well for inspiration; non-capturing groups, string replacing and string removal.

library(stringr)
txt <- c("a ~ b c d*e !r x",
         "a ~ b c",
         "a ~ b c d1 !r y",
         "a ~ b c D !r z",
         "a~b c d*e!r z")

# Different tries with capture groups
str_replace(txt, "^.*~ (.*) !r.*$", "\\1")
> [1] "b c d*e"       "a ~ b c"       "b c d1"        "b c D"        
> [5] "a~b c d*e!r z"
str_replace(txt, "^(.*~ )(.*)( !r.*)$", "\\2")
> [1] "b c d*e"       "a ~ b c"       "b c d1"        "b c D"        
> [5] "a~b c d*e!r z"
str_replace(txt, "^(.*~)(.*)(!r.*|\n)$", "\\1\\2")
> [1] "a ~ b c d*e " "a ~ b c"      "a ~ b c d1 "  "a ~ b c D "  
> [5] "a~b c d*e"
str_replace(txt, "^(.*) ~ (.*)!r.*($)", "\\2")
> [1] "b c d*e "      "a ~ b c"       "b c d1 "       "b c D "       
> [5] "a~b c d*e!r z"
str_replace(txt, "^.* ~ (.*)(!r.*|\n)$", "\\1")
> [1] "b c d*e "      "a ~ b c"       "b c d1 "       "b c D "       
> [5] "a~b c d*e!r z"


# Multiple steps
step1 <- str_replace(txt, "^.*~\\s*", "")
step2 <- str_replace(step1, "\\s*!r.*$", "")
step2
> "b c d*e" "b c"     "b c d1"  "b c D"   "b c d*e"

EDIT: After posting I kept playing around and found something that worked for my particular case.

# My (probably non-robust) solution/monstrosity
str_replace(txt, "(^.*~\\s*(.*)\\s*!r.*$|^.*~\\s*(.*)$)", "\\2\\3")
> "b c d*e " "b c"      "b c d1 "  "b c D "   "b c d*e"

sindri_baldur · Answer 1 · 2018-08-03T15:45:06.427

3

What about str_extract() using positive lookbehind and lookahead?

str_extract(st, "(?<=~)[^!]+") %>% trimws()
[1] "b c d*e" "b c"     "b c d1"  "b c D"   "b c d*e"

My try to rephrase in English:

We are looking for something that is preceded by a ~ (?<=~), and is a sequence of 1 or more characters that are not ! [^!]+, when we have found something that fits our criteria we stop searching that string (otherwise use str_extract_all()). Finalement, if what we extracted has any spaces at the start of end of string, then remove them trimws().

Data:

st <- c(
  'a ~ b c d*e !r x',
  'a ~ b c',
  'a ~ b c d1 !r y',
  'a ~ b c D !r z',
  'a~b c d*e!r z'
)

EDIT

Few updates already as examples of inputs grow. Will not update again.

edited Aug 03 '18 at 15:45

answered Aug 03 '18 at 14:48

sindri_baldur

22,360
2
25
48

Interesting, will play around with this. After posting I came up with my own regex monstrosity that seems to work (also works on more cases); `str_replace(st, "(^.*~\\s*(.*)\\s*!r.*$|^.*~\\s*(.*)$)", "\\2\\3")`. It's a shame that with the first string there is an extra space at the end. Nothing `str_trim` can't handle, but still... – tstev Aug 03 '18 at 15:00
you mind if I throw some more cases at your solution that might "break" it? – tstev Aug 03 '18 at 15:09
Sure throw them in - but better that you asked. – sindri_baldur Aug 03 '18 at 15:09
My question is oversimplification of the my actual problem. For example, I used `a` and `b` but in actual fact this can also be `b1` or `X`. So additional cases like `st – tstev Aug 03 '18 at 15:14
Thanks! I am trying to understand these positive/negative look ahead/behinds ..not very intuitive .. for me at least ..for now.. – tstev Aug 03 '18 at 15:34
1

@tstev Made another one and a final update. The current one only has a lookbehind which is `(?<=~)` and means *has to be preceded by `~`*. – sindri_baldur Aug 03 '18 at 15:37

score 3 · Accepted Answer · answered Aug 03 '18 at 16:10

I suggest removing all from the start and up to and incluiding the first tilde (with optional whitespaces) and all starting with the first !r as whole word:

gsub("^[^~]+~\\s*|\\s*!r\\b.*", "", txt)

See the regex demo

Details

^ - start of string
[^~]+ - 1+ chars other than ~
~ - a ~ char
\\s* - 0+ whitespaces
| - or
\\s* - 0+ whitespaces
!r - !r substring
\\b - word boundary
.* - the rest of the string.

R demo:

txt <- c("a ~ b c d*e !r x",
         "a ~ b c",
         "a ~ b c d1 !r y",
         "a ~ b c D !r z",
         "a~b c d*e!r z")
gsub("^[^~]+~\\s*|\\s*!r\\b.*", "", txt)
## => [1] "b c d*e" "b c"     "b c d1"  "b c D"   "b c d*e"

I ended up using this for my final solution. Hence, this was chosen as answer. — tstev, Aug 06 '18 at 12:04

Michał Turczyn · Answer 3 · 2018-08-03T16:18:32.527

1

This pattern will let you extract with first capturing group the text you want: ~ ?([\w\*\-\+\/ ]+)(!r)?.

First capturing group: [\w\*\-\+\/ ]+ matches any word character \w or *, +, -, \ and space one or more times (+). It will be terminetaed before second capturing group (if any) (!r)?.

Demo

edited Aug 03 '18 at 16:18

answered Aug 03 '18 at 15:51

Michał Turczyn

28,428
14
36
58

Thanks for the explanation! However I can't seem to get this to work in `R` with the `stringr` package. i.e. it didn't remove the characters before the `~` or after the `!r` so I edited to: `str_replace(txt, ".*~ ?([\\w\\*\\-\\+\\/ ]*)(!r.*)?", "\\1")` and this seems to work for my cases. Perhaps you meant to use in a different way? – tstev Aug 06 '18 at 08:02

Extracting specific parts of an input string using stringr package in R

3 Answers3