1

I have a dataset with a string column like below, and I am trying to extract the number from the string. I have achieved the extraction for observations that do not contain two decimals. When attempting to extract an observation with two decimals, I am having trouble. I am trying to replace the first decimal with a | like below:

library(stringr)
words=data.frame(text=c('I need a number. It is the number 40.6',
                   'I bet youd like this number. Too bad but it is 52.3',
                   'This number is important. It is 1.6'))
words$new_text=str_replace(string = words$text,
            pattern = '.',
            replacement = '|')

words$new_text
#> [1] "| need a number. It is the number 40.6"             
#> [2] "| bet youd like this number. Too bad but it is 52.3"
#> [3] "|his number is important. It is 1.6"

The problem arises that we can see instead of the first . being replaced with | as the case would be with other character types, the first character in the string is replaced with |, i.e. I expected this:

library(stringr)
words=data.frame(text=c('I need a number. It is the number 40.6',
                        'I bet youd like this number. Too bad but it is 52.3',
                        'This number is important. It is 1.6'))
words$new_text2=str_replace(string = words$text,
                           pattern = 'n',
                           replacement = '|')
words$new_text2
#> [1] "I |eed a number. It is the number 40.6"             
#> [2] "I bet youd like this |umber. Too bad but it is 52.3"
#> [3] "This |umber is important. It is 1.6"

EDIT: "... trying to extract the number ...", not "... trying to extract the second number ..."

coconn41
  • 213
  • 1
  • 6
  • 1
    By *second number*, do you mean the decimal portion of the floating point numbers? For instance, are you expecting `c(6, 3, 6)` or `c(40.6, 52.3, 1.6)`? – r2evans Feb 19 '21 at 18:02
  • 1
    BTW, the `.` matches *any character* in regexes. To match the literal, use `\\.` in R (or `\.` in most other languages). (For reference, https://stackoverflow.com/a/22944075/3358272) – r2evans Feb 19 '21 at 18:03
  • Whoops, my question was unclear. See edit. – coconn41 Feb 19 '21 at 18:06
  • 1
    Yeah, @DaveArmstrong's `stringr::str_extract` is what you need, two notes: (1) it is returning `character`, you may want `as.numeric` on that return; (2) if there are multiple numbers, it will silently ignore the second and beyond (perhaps not a problem in your case). – r2evans Feb 19 '21 at 18:17

1 Answers1

3

You could use a function like this to replace the first period.

library(stringr)
words=data.frame(text=c('I need a number. It is the number 40.6',
                        'I bet youd like this number. Too bad but it is 52.3',
                        'This number is important. It is 1.6'))
words$new_text=str_replace(string = words$text,
                           pattern = '\\.',
                           replacement = '|')

Or, if all you want is really the number, you could just get it with:

words$number=str_extract(string = words$text,
                           pattern = '\\d+\\.\\d*$')
words %>% dplyr::select(new_text, number)
#                                              new_text number
# 1              I need a number| It is the number 40.6   40.6
# 2 I bet youd like this number| Too bad but it is 52.3   52.3
# 3                 This number is important| It is 1.6    1.6

DaveArmstrong
  • 6,161
  • 1
  • 5
  • 14