-2

I've got the following data:

dataset <- structure(list(id = structure(c(2L, 3L, 1L, 3L, 1L, 9L), .Label = c("215101", 
"215559", "216566", "217284", "219435", "220209", "220249", "220250", 
"225678", "225679", "225687", "225869", "228420", "228435", "230621", 
"230623", "233063", "233097", "233098", "235546", "235560", "235567", 
"236379"), class = "factor"), cat1 = c("A", "B", "B", "A", "A", 
"A"), cat2 = c("item 1", "item 1", "item 2", "item 5", "item 3", 
"item 28"), cat3 = c("theme 2", "theme 2", "theme 1", "theme 4", 
"theme 10", "theme 40")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))

I would like to create kind of model matrix with one hot encoded columns features created from columns cat2 and cat3. Therefore, my output would look like this:

structure(list(id = structure(c(1L, 1L, 2L, 3L, 3L, 9L), .Label = c("215101", 
"215559", "216566", "217284", "219435", "220209", "220249", "220250", 
"225678", "225679", "225687", "225869", "228420", "228435", "230621", 
"230623", "233063", "233097", "233098", "235546", "235560", "235567", 
"236379"), class = "factor"), cat1 = c("A", "B", "A", "A", "B", 
"A"), `item 1` = c(0, 0, 1, 0, 1, 0), `item 2` = c(0, 1, 0, 0, 
0, 0), `item 28` = c(0, 0, 0, 0, 0, 1), `item 3` = c(1, 0, 0, 
0, 0, 0), `item 5` = c(0, 0, 0, 1, 0, 0), `theme 1` = c(0, 1, 
0, 0, 0, 0), `theme 10` = c(1, 0, 0, 0, 0, 0), `theme 2` = c(0, 
0, 1, 0, 1, 0), `theme 4` = c(0, 0, 0, 1, 0, 0), `theme 40` = c(0, 
0, 0, 0, 0, 1)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))

However, I don't have my independent variable in this dataset and I would like to preserve id and cat1 columns. How can I do that?

jakes
  • 1,636
  • 9
  • 33
  • Just in case any of the two answers solved your problem please consider to [accept](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) one. – markus Dec 17 '18 at 11:33

1 Answers1

1

You could use merge and dcast twice.

library(reshape2)
merge(dcast(dataset, id + cat1 ~ cat2, fun.aggregate = length),
      dcast(dataset, id + cat1 ~ cat3, fun.aggregate = length),
      by = c("id", "cat1"))
#      id cat1 item 1 item 2 item 28 item 3 item 5 theme 1 theme 10 theme 2 theme 4 theme 40
#1 215101    A      0      0       0      1      0       0        1       0       0        0
#2 215101    B      0      1       0      0      0       1        0       0       0        0
#3 215559    A      1      0       0      0      0       0        0       1       0        0
#4 216566    A      0      0       0      0      1       0        0       0       1        0
#5 216566    B      1      0       0      0      0       0        0       1       0        0
#6 225678    A      0      0       1      0      0       0        0       0       0        1

If you have more then two variables to spread you might melt you data first. This will save you some typing.

dcast(melt(dataset, id.vars = c("id", "cat1")), id + cat1 ~ value, fun.aggregate = length)
markus
  • 23,189
  • 5
  • 29
  • 47