0

I would like to extract the last set of digits from a string without doing this.

"sdkjfn45sdjk54()ad"

str_remove("sdkjfn45sdjk54()ad","[:alpha:]+$")
[1] "sdkjfn45sdjk54()"

str_remove(str_remove("sdkjfn45sdjk54()ad","[:alpha:]+$"), "\\(")
[1] "sdkjfn45sdjk54)"

str_remove(str_remove(str_remove("sdkjfn45sdjk54()ad","[:alpha:]+$"), "\\("), "\\)")
[1] "sdkjfn45sdjk54"

str_extract(str_remove(str_remove(str_remove("sdkjfn45sdjk54()ad","[:alpha:]+$"), "\\("), "\\)"), "\\d+$")
[1] "54"

because the patterns are uncertain. I am aware that stringi has a str_extract_from_last function but I need to stick to base R or stringR.

Thanks!

2 Answers2

2

You can use negative lookahead regex.

string <- "sdkjfn45sdjk54()ad"
stringr::str_extract(string, '(\\d+)(?!.*\\d)')
#[1] "54"

Using the same regex in base R :

regmatches(string, gregexpr('(\\d+)(?!.*\\d)', string, perl = TRUE))[[1]]

This extracts the set of numbers which is not followed by any number so last set of numbers.

Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • Thanks Ronak! Could you walk me through the regex for this - '(\\d+)(?!.*\\d)'? – Jantje Houten Apr 20 '21 at 13:34
  • 2
    `(\\d+)` is one or more digits. `(?!.*\\d)` is a bit more: `(?!` is negative look-ahead, meaning it is checked and matched, but not "captured" with the pattern. (I'd think a `$` might be needed, as in `(?!.*\\d)$`, but perhaps not.) A good reference for regex: https://stackoverflow.com/a/22944075/3358272, recognize though that that is generic regex, and R requires double-backslashes whereever that guide uses a single-backslash. – r2evans Apr 20 '21 at 13:40
1

Use str_extract_all and grab just the last one in each vector.

library(stringr)
quux <- str_extract_all(c("a", "sdkjfn45sdjk54()ad"), "[0-9]+")
sapply(quux, `[`, lengths(quux))
# [1] NA   "54"

I use sapply because I'm guessing that you have more than one string. str_extract_all will return a list, where each element is zero or more strings extracted from the source. Since we're only interested in one of those, we can use sapply.

One might be tempted to use sapply(., tail, 1), but if zero are found, then it will be character(0), not empty or NA. I'm inferring that NA would be a good return when the pattern is not found.

r2evans
  • 77,184
  • 4
  • 55
  • 96