1

I'm stuck on a problem, let's say we have a 1 column dataframe dfCHEM

CHEM_NAME
Aspirin
Captopril
(...)

I want to create a second column based on the string of the first using webchem::get_cid()

CHEM_NAME    CID
Aspirin      2244
Captopril    44093
(...)

I try this code which doesn't work:

dfCHEM %>%
    mutate(CID=get_cid(CHEM_NAME)[[1]])

I'm convinced that it's related to the use of a get_cid() function inside the mutate which doesn't retrieve CHEM_NAME string value at the corresponding row, but i don't know how to correct this in an efficient manner.

www
  • 35,154
  • 12
  • 33
  • 61
TiFr3D
  • 329
  • 2
  • 8

1 Answers1

2

You can add rowwise to your code to force the operation to each row.

library(dplyr)
library(webchem)

dfCHEM %>%
  rowwise() %>%
  mutate(CID = get_cid(CHEM_NAME)[[1]]) %>%
  ungroup()

# # A tibble: 2 x 2
#   CHEM_NAME   CID
#       <chr> <int>
# 1   Aspirin  2244
# 2 Captopril 44093

Or use lapply and unlist.

dfCHEM %>%
  mutate(CID = unlist(lapply(CHEM_NAME, get_cid)))

#   CHEM_NAME   CID
# 1   Aspirin  2244
# 2 Captopril 44093

DATA

dfCHEM <- read.table(text = "CHEM_NAME
Aspirin
                     Captopril",
                     header = TRUE, stringsAsFactors = FALSE)
www
  • 35,154
  • 12
  • 33
  • 61
  • 1
    `dfChem %>% mutate(CID = sapply(get_cid(CHEM_NAME), \`[[\`, 1))` works without rowwise, so there's only one call to get_cid. I guess `unlist`-`lapply` is unsafe if get_cid might return a vector for each element. Alternately, just use `first=TRUE` from the docs... – Frank Dec 01 '17 at 19:07
  • @Frank Good suggestion. If there are more than one return values per `CID`. The safest way is probably just `dfCHEM %>% mutate(CID=get_cid(CHEM_NAME))` and store the information as a list column in the data frame for further analysis. – www Dec 01 '17 at 19:14