29

I'm working on a prediction problem and I'm building a decision tree in R, I have several categorical variables and I'd like to one-hot encode them consistently in my training and testing set. I managed to do it on my training data with :

temps <- X_train
tt <- subset(temps, select = -output)
oh <- data.frame(model.matrix(~ . -1, tt), CLASS = temps$output)

But I can't find a way to apply the same encoding on my testing set, how can I do that?

Esteban PS
  • 842
  • 1
  • 9
  • 12
xeco
  • 391
  • 1
  • 3
  • 3

5 Answers5

40

I recommend using the dummyVars function in the caret package:

customers <- data.frame(
  id=c(10, 20, 30, 40, 50),
  gender=c('male', 'female', 'female', 'male', 'female'),
  mood=c('happy', 'sad', 'happy', 'sad','happy'),
  outcome=c(1, 1, 0, 0, 0))
customers
id gender  mood outcome
1 10   male happy       1
2 20 female   sad       1
3 30 female happy       0
4 40   male   sad       0
5 50 female happy       0


# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
trsf
id gender.female gender.male mood.happy mood.sad outcome
1 10             0           1          1        0       1
2 20             1           0          0        1       1
3 30             1           0          1        0       0
4 40             0           1          0        1       0
5 50             1           0          1        0       0

example source

You apply the same procedure to both the training and validation sets.

Esteban PS
  • 842
  • 1
  • 9
  • 12
  • 10
    I found that the caret approach (with dummyVars) is about 73% faster than the `one_hot()` function from the `mltools` package. Using the `microbenchmark` package and `iris` data set, the caret method finishes in 0.025 milliseconds, while the `one_hot()` method finishes in 0.095 milliseconds. – Dale Kube Dec 19 '18 at 00:56
  • 1
    @DaleKube have you included the `data.frame(predict(dmy, newdata = customers))` in your benchmark? Apparently `dummyVars` alone will not give you the actual dummies – robertspierre Apr 21 '19 at 17:00
  • 2
    If you have a dataframe with different variables, and you want to one-hot encode just some of them, you need to use something like `dummyVars(" ~ VARIABLE1 + VARIABLE2", data = customers)` – robertspierre Apr 21 '19 at 17:04
  • 1
    @raffamaiden yes, I included the predict() call and conversion to data.frame. – Dale Kube Apr 23 '19 at 01:29
  • Here's an alternative using recipes (tidymodels) package: https://blog.datascienceheroes.com/how-to-use-recipes-package-for-one-hot-encoding/ – Pablo Casas Jul 24 '19 at 14:54
  • An added bonus of the caret approach is that it removes any inner white space within factor levels when it makes new columns in the model matrix – nate Sep 11 '19 at 16:51
20

Code

library(data.table)
library(mltools)
customers_1h <- one_hot(as.data.table(customers))

Result

> customers_1h
id gender_female gender_male mood_happy mood_sad outcome
1: 10             0           1          1        0       1
2: 20             1           0          0        1       1
3: 30             1           0          1        0       0
4: 40             0           1          0        1       0
5: 50             1           0          1        0       0

Data

customers <- data.frame(
  id=c(10, 20, 30, 40, 50),
  gender=c('male', 'female', 'female', 'male', 'female'),
  mood=c('happy', 'sad', 'happy', 'sad','happy'),
  outcome=c(1, 1, 0, 0, 0))
Roman
  • 3,737
  • 2
  • 13
  • 46
20

Here's a simple solution to one-hot-encode your category using no packages.

Solution

model.matrix(~0+category)

It needs your categorical variable to be a factor. The factor levels must be the same in your training and test data, check with levels(train$category) and levels(test$category). It doesn't matter if some levels don't occur in your test set.

Example

Here's an example using the iris dataset.

data(iris)
#Split into train and test sets.
train <- sample(1:nrow(iris),100)
test <- -1*train

iris[test,]

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
34           5.5         4.2          1.4         0.2    setosa
106          7.6         3.0          6.6         2.1 virginica
112          6.4         2.7          5.3         1.9 virginica
127          6.2         2.8          4.8         1.8 virginica
132          7.9         3.8          6.4         2.0 virginica

model.matrix() creates a column for each level of the factor, even if it is not present in the data. Zero indicates it is not that level, one indicates it is. Adding the zero specifies that you do not want an intercept or reference level and is equivalent to -1.

oh_train <- model.matrix(~0+iris[train,'Species'])
oh_test <- model.matrix(~0+iris[test,'Species'])

#Renaming the columns to be more concise.
attr(oh_test, "dimnames")[[2]] <- levels(iris$Species)


  setosa versicolor virginica
1      1          0         0
2      0          0         1
3      0          0         1
4      0          0         1
5      0          0         1

P.S. It's generally preferable to include all categories in training and test data. But that's none of my business.

D A Wells
  • 463
  • 4
  • 9
3

Hi here is my version of the same, this function encodes all categorical variables which are 'factors' , and removes one of the dummy variables to avoid dummy variable trap and returns a new Data frame with the encoding :-

onehotencoder <- function(df_orig) {
  df<-cbind(df_orig)
  df_clmtyp<-data.frame(clmtyp=sapply(df,class))
  df_col_typ<-data.frame(clmnm=colnames(df),clmtyp=df_clmtyp$clmtyp)
  for (rownm in 1:nrow(df_col_typ)) {
    if (df_col_typ[rownm,"clmtyp"]=="factor") {
      clmn_obj<-df[toString(df_col_typ[rownm,"clmnm"])] 
      dummy_matx<-data.frame(model.matrix( ~.-1, data = clmn_obj))
      dummy_matx<-dummy_matx[,c(1,3:ncol(dummy_matx))]
      df[toString(df_col_typ[rownm,"clmnm"])]<-NULL
      df<-cbind(df,dummy_matx)
      df[toString(df_col_typ[rownm,"clmnm"])]<-NULL
    }  }
  return(df)
}
0

In case you don't want to use any external package I have my own function:

one_hot_encoding = function(df, columns="season"){
  # create a copy of the original data.frame for not modifying the original
  df = cbind(df)
  # convert the columns to vector in case it is a string
  columns = c(columns)
  # for each variable perform the One hot encoding
  for (column in columns){
    unique_values = sort(unique(df[column])[,column])
    non_reference_values  = unique_values[c(-1)] # the first element is going 
                                                 # to be the reference by default
    for (value in non_reference_values){
      # the new dummy column name
      new_col_name = paste0(column,'.',value)
      # create new dummy column for each value of the non_reference_values
      df[new_col_name] <- with(df, ifelse(df[,column] == value, 1, 0))
    }
    # delete the one hot encoded column
    df[column] = NULL

  }
  return(df)
}

And you use it like this:

df = one_hot_encoding(df, c("season"))
Ángel
  • 312
  • 3
  • 13