1

I need to extract numbers prior to their respective units from a string. Unfortunately the inputs sometimes vary and this is giving me trouble.

Sample data:

df <- data.frame(id = c(1, 2, 3, 4),
                 targets = c("1800 kcal 75 g", "2000kcal 80g", "1900 kcal,87g", "2035kcal,80g"))

> df
  id        targets
1  1 1800 kcal 75 g
2  2   2000kcal 80g
3  3  1900 kcal,87g
4  4   2035kcal,80g

Desired output:

df <- data.frame(id = c(1, 2, 3, 4),
                 targets = c("1800 kcal 75 g", "2000kcal 80g", "1900 kcal,87g", "2035kcal,80g"),
                 kcal_target = c("1800", "2000", "1900", "2035"),
                 protein_target = c("75", "80", "87", "80"))

> df
  id        targets kcal_target protein_target
1  1 1800 kcal 75 g        1800             75
2  2   2000kcal 80g        2000             80
3  3  1900 kcal,87g        1900             87
4  4   2035kcal,80g        2035             80

I got as far as this but it is breaking down with spaces between the numbers and unit keyword and a comma after the number keyword.

df <- df %>%
  mutate(calorie_target = str_extract_all(targets, regex("\\d+(?=kcal)|\\d+(?=kcal,)"))) %>%
  mutate(protein_target = str_extract_all(targets,  regex("\\d+(?=g)")))

> df
  id        targets calorie_target protein_target
1  1 1800 kcal 75 g                              
2  2   2000kcal 80g           2000             80
3  3  1900 kcal,87g                            87
4  4   2035kcal,80g           2035             80

edit: removed portion of code I'm not trying to capture

nlp
  • 81
  • 4

2 Answers2

3

Here is a data.table option using regmatches + transpose

setDT(df)[, setNames(transpose(regmatches(targets, gregexpr("\\d+", targets))), c("kcal_target", "protein_target")), id]

which gives

   id kcal_target protein_target
1:  1        1800             75
2:  2        2000             80
3:  3        1900             87
4:  4        2035             80
ThomasIsCoding
  • 53,240
  • 4
  • 13
  • 45
3

Base R with strcapture:

strcapture("(\\d+)\\D+(\\d+)", df$targets, list(calorie=0L, protein=0L))
#   calorie protein
# 1    1800      75
# 2    2000      80
# 3    1900      87
# 4    2035      80

You can cbind this to the original:

cbind(df, strcapture("(\\d+)\\D+(\\d+)", df$targets, list(calorie=0L, protein=0L)))
#   id        targets calorie protein
# 1  1 1800 kcal 75 g    1800      75
# 2  2   2000kcal 80g    2000      80
# 3  3  1900 kcal,87g    1900      87
# 4  4   2035kcal,80g    2035      80

If you wanted to put this in a dplyr pipe, then

library(dplyr)
df %>%
  bind_cols(strcapture("(\\d+)\\D+(\\d+)", .$targets, list(calorie=0L, protein=0L)))
#   id        targets calorie protein
# 1  1 1800 kcal 75 g    1800      75
# 2  2   2000kcal 80g    2000      80
# 3  3  1900 kcal,87g    1900      87
# 4  4   2035kcal,80g    2035      80

Note that strcapture uses regexec and regmatches under the hood, so this is similar to @ThomasIsCoding's answer in that respect.

For the regex,

A good reference if you need it is https://stackoverflow.com/a/22944075/3358272.

r2evans
  • 77,184
  • 4
  • 55
  • 96