-1

How can I create a function in R that locates the word position of the first number in a string?

For example:

string1 <- "Hello I'd like to extract where the first 1010 is in this string"
#desired_output for string1
9

string2 <- "80111 is in this string"
#desired_output for string2
1

string3 <- "extract where the first 97865 is in this string"
#desired_output for string3
5
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
Neal Barsch
  • 1,781
  • 7
  • 29

6 Answers6

5

I would just use grep and strsplit here for a base R option:

sapply(input, function(x) grep("\\d+", strsplit(x, " ")[[1]]))

Hello I'd like to extract where the first 1010 is in this string
                                                               9
                                         80111 is in this string
                                                               1
                 extract where the first 97865 is in this string
                                                               5

Data:

input <- c("Hello I'd like to extract where the first 1010 is in this string",
           "80111 is in this string",
           "extract where the first 97865 is in this string")
Tim Biegeleisen
  • 387,723
  • 20
  • 200
  • 263
4

Here is a way to return your desired output:

library(stringr)
min(which(!is.na(suppressWarnings(as.numeric(str_split(string, " ", simplify = TRUE))))))

This is how it works:

str_split(string, " ", simplify = TRUE) # converts your string to a vector/matrix, splitting at space

as.numeric(...) # tries to convert each element to a number, returning NA when it fails

suppressWarnings(...) # suppresses the warnings generated by as.numeric

!is.na(...) # returns true for the values that are not NA (i.e. the numbers)

which(...) # returns the position for each TRUE values

min(...) # returns the first position

The output:

min(which(!is.na(suppressWarnings(as.numeric(str_split(string1, " ", simplify = TRUE))))))
[1] 9
min(which(!is.na(suppressWarnings(as.numeric(str_split(string2, " ", simplify = TRUE))))))
[1] 1
min(which(!is.na(suppressWarnings(as.numeric(str_split(string3, " ", simplify = TRUE))))))
[1] 5
Ben Norris
  • 4,733
  • 2
  • 5
  • 14
1

Here is another approach. We can trim off the remaining characters after the first digit of the first number. Then, just find the position of the last word. \\b matches word boundaries while \\S+ matches one or more non-whitespace characters.

first_numeric_word <- function(x) {
  x <- substr(x, 1L, regexpr("\\b\\d+\\b", x))
  lengths(gregexpr("\\b\\S+\\b", x))
}

Output

> first_numeric_word(x)
[1] 9 1 5

Data

x <- c(
  "Hello I'd like to extract where  the first 1010 is in this string", 
  "80111 is in this string", 
  "extract where the   first  97865 is in this string"
)
ekoam
  • 6,165
  • 1
  • 5
  • 19
1

Here I'll leave a fully tidyverse approach:

library(purrr)
library(stringr)

map_dbl(str_split(strings, " "), str_which, "\\d+")
#> [1] 9 1 5

map_dbl(str_split(strings[1], " "), str_which, "\\d+")
#> [1] 9

Note that it works both with one and multiple strings.


Where strings is:

strings <- c("Hello I'd like to extract where the first 1010 is in this string",
             "80111 is in this string",
             "extract where the first 97865 is in this string")
Edo
  • 5,813
  • 2
  • 5
  • 17
0

Try the following:

library(stringr)

position_first_number <- function(string) {
  min(which(str_detect(str_split(string, "\\s+", simplify = TRUE), "[0-9]+")))
}

With your example strings:

> string1 <- "Hello I'd like to extract where the first 1010 is in this string"
> position_first_number(string1)
[1] 9
 
> string2 <- "80111 is in this string"
> position_first_number(string2)
[1] 1
 
> string3 <- "extract where the first 97865 is in this string"
> position_first_number(string3)
[1] 5
semaphorism
  • 766
  • 1
  • 13
0

Here is a base solution using rapply() w/ grep() to recurse through the results of strsplit() and works with a vector of strings.

Note: swap " " and fixed = TRUE with "\\s+" and fixed = FALSE (the default) if you want to split the strings on any whitespace instead of a literal space.

rapply(strsplit(strings, " ", fixed = TRUE), function(x) grep("[0-9]+", x))
[1] 9 1 5

Data:

strings = c("Hello I'd like to extract where the first 1010 is in this string", 
            "80111 is in this string", "extract where the first 97865 is in this string")
Andrew
  • 4,653
  • 2
  • 8
  • 20
  • Hey @TimBiegeleisen, if I am honest, I am not completely sure what you mean. Can you clarify? – Andrew Nov 03 '20 at 03:08