0

I have a csv file with categorical and numerical data. I want to read in the csv file as a data frame, but I want to convert certain categorical variables to factors, and I want to transform the data of certain numerical variables with a log10 transformation.

I know that the relevant functions are read.csv() (automatically reads data in as a data frame), factor(), and log10(), but I've been unable to find a way to do this. How is this done?

halfer
  • 18,701
  • 13
  • 79
  • 158
The Pointer
  • 1,602
  • 5
  • 13
  • 36
  • Please provide a minimal reproducible example. – Christoph Aug 10 '20 at 09:20
  • You can transform any column of a data.frame by applying a function to it. So, if you want to change the class of variable `state` in data.frame `data`to factor, just write `data$state = as.factor(data$state)`. Similarly, you may do any arithmetic computations: `data$cellcount=log(data$cellcount,10)` – Martin Wettstein Aug 10 '20 at 09:22

2 Answers2

2

Using read.csv read the data in R.

df <- read.csv('/path/of/file.csv')

Let's assume your df looks something like this :

set.seed(123)
df <- data.frame(a = runif(5), b = letters[sample(5)], 
                 c = letters[sample(5)], d = runif(5), e = 1:5)

Create a vector of column names that you want to change to respective class.

factor_cols <- c('b', 'c')
log_cols <- c('a', 'd')

Now apply the functions to those columns. Using dplyr, you can do as :

library(dplyr)
new_df <- df %>% 
          mutate(across(factor_cols, factor), 
                across(log_cols, log10))

Or in base R :

df[factor_cols] <- lapply(df[factor_cols], factor)
df[log_cols] <- lapply(df[log_cols], log10)
Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • Ahh, yes, I think `dplyr` is what I was looking for. However, I get the following message: – The Pointer Aug 10 '20 at 10:09
  • Note: Using an external vector in selections is ambiguous. ℹ Use `all_of(factor_cols)` instead of `factor_cols` to silence this message. ℹ See . This message is displayed once per session. Note: Using an external vector in selections is ambiguous. ℹ Use `all_of(log_cols)` instead of `log_cols` to silence this message. ℹ See . This message is displayed once per session. – The Pointer Aug 10 '20 at 10:09
  • 1
    Yes, that is a warning which is shown only once in each session so if you run the code again it will not be shown. Safe/preferred way is to use `all_of` like this `across(all_of(factor_cols), factor)` and same for `log_cols`. – Ronak Shah Aug 10 '20 at 10:12
  • Yes, that works. Thank you for the help! – The Pointer Aug 10 '20 at 10:17
0

Here is a complete, working example using the Pokémon Stats data. We can automate the conversion of columns by obtaining the column types from the input data.

gen01file <- "https://raw.githubusercontent.com/lgreski/pokemonData/master/gen01.csv"

gen01 <- read.csv(gen01file,header=TRUE,stringsAsFactors = FALSE)

At this point, gen01 data frame consists of some character columns, some integer columns, and a logical column.

enter image description here

Next, we'll extract the column types with a combination of lapply() and unlist().

# extract the column types
colTypes<- unlist(lapply(gen01[colnames(gen01)],typeof))

At this point, colTypes is a vector that contains the column types, where each element is named by the column name. This is important, because now we can extract the names and automate the process of converting character variables to factor, and integer / double variables with a log10() transformation.

# find character types to convert to factor, using element names from
# colTypes vector
factorColumns <- names(colTypes[colTypes == "character"])
logColumns <- names(colTypes[colTypes %in% c("integer","double")])

Note that at this point we could potentially subset the column name objects further (e.g. use regular expressions to pull certain names from the list of columns, given their data type).

Finally, we use lapply() to apply the appropriate transform on the relevant columns, as noted in Ronak Shah's answer.

gen01[factorColumns] <- lapply(gen01[factorColumns],factor)
gen01[logColumns] <- lapply(gen01[logColumns],log10)

As we can see from the RStudio object viewer, the character variables are now factors, and the values of the integer columns have been log transformed. The Legendary logical column is untouched.

enter image description here

Len Greski
  • 8,565
  • 2
  • 15
  • 28