using a loop for creating multiple dummy variables

Question

I am trying to create a loop to create dummy variable columns. My column mydata$code contains codes like "AA1" or "AA2". I want to create an indivual column for each code, where the value of the row is 1 if the code is hit and 0 otherwise. I have multiple codes hence I don't want to do it all by hand. How would I fix my code below to achieve this?

x <- c("AA1","AA2","AA3","AA4")
for(i in 1:x){
  mydata$code_[i] <- as.integer(str_detect(mydata$code,"i"))
}

current

    ID Code 
1 9343  AA1    
2 8333  AA1   
3 6449  AA3

desired

    ID Code AA1 AA2 AA3 AA4
1 9343  AA1   1   0   0   0
2 8333  AA1   1   0   0   0
3 6449  AA3   0   0   1   0

I didn't say you are, try it with the posted vector. Or with another small vector. — Rui Barradas, May 13 '21 at 05:45
Can you post expected output? There is also [this SO post](https://stackoverflow.com/questions/11952706/generate-a-dummy-variable/40343274) you could check out. — Rui Barradas, May 13 '21 at 05:47
Possible duplicate: https://stackoverflow.com/questions/48649443/how-to-one-hot-encode-several-categorical-variables-in-r or https://stackoverflow.com/questions/52539750/r-how-to-one-hot-encoding-a-single-column-while-keep-other-columns-still — MrFlick, May 13 '21 at 06:04

edsandorf · Answer 1 · 2021-05-13T06:36:45.763

Could try to do something like this where you loop over the elements in x and then use an ifelse statement


x <- c("AA1", "AA2", "AA3", "AA4")


db <- data.frame(codes = sample(x, 10, TRUE))

db_new <- cbind(db, Reduce(cbind, lapply(x, function(i) ifelse(db$codes == i, 1, 0))))

If db is:

   codes
1    AA4
2    AA1
3    AA4
4    AA1
5    AA2
6    AA4
7    AA4
8    AA1
9    AA1
10   AA1

Then output becomes:

   codes init V2 V3 V4
1    AA4    0  0  0  1
2    AA1    1  0  0  0
3    AA4    0  0  0  1
4    AA1    1  0  0  0
5    AA2    0  1  0  0
6    AA4    0  0  0  1
7    AA4    0  0  0  1
8    AA1    1  0  0  0
9    AA1    1  0  0  0
10   AA1    1  0  0  0

EDIT:

It appears that your subscript is wrong. db$code[j] will take the jth element of the column code in db. So that will obviously not work. You could try this:

Assuming that you are using the same codes for all columns and that they are given in x:

x <- c("AA1", "AA2", "AA3", "AA4")

Furthermore, assume that all your code columns are in your data.frame and that this is the only data in your data.frame.

db <- data.frame(codes_1 = sample(x, 10, TRUE),
                 codes_2 = sample(x, 10, TRUE))

Then we can use the fact that the data.frame works like a list and can be passed through lapply.

db_list <- lapply(seq_along(db), function(i, x) {
  var <- db[[i]]
  var_name <- colnames(db[i])
  db_tmp <- cbind(db[i], Reduce(cbind, lapply(x, function(j) ifelse(var == j, 1, 0))))
  colnames(db_tmp) <- c(var_name, paste(var_name, x, sep = "_"))
  return(db_tmp)
  
}, x)

[[1]]
   codes_1 codes_1_AA1 codes_1_AA2 codes_1_AA3 codes_1_AA4
1      AA1           1           0           0           0
2      AA2           0           1           0           0
3      AA4           0           0           0           1
4      AA3           0           0           1           0
5      AA3           0           0           1           0
6      AA3           0           0           1           0
7      AA3           0           0           1           0
8      AA1           1           0           0           0
9      AA4           0           0           0           1
10     AA4           0           0           0           1

[[2]]
   codes_2 codes_2_AA1 codes_2_AA2 codes_2_AA3 codes_2_AA4
1      AA4           0           0           0           1
2      AA3           0           0           1           0
3      AA3           0           0           1           0
4      AA3           0           0           1           0
5      AA4           0           0           0           1
6      AA1           1           0           0           0
7      AA4           0           0           0           1
8      AA2           0           1           0           0
9      AA2           0           1           0           0
10     AA3           0           0           1           0

This gives you a list the length of the nubmer of columsn that you have, each with the desired matrix. If you want to get it all back into one, you can do this:

Reduce(cbind, db_list)

That seems to work fine. There are actually about 20 code fields to search from. They are `code_1,code_2,...,code_20`. Is there a way to loop through these? — user8261831, May 13 '21 at 06:03
That should work. Should then perhaps add a new column name to each before binding them to `db`. The easiest would be to append using the value of `j` or better the actual column name of the code field. — edsandorf, May 13 '21 at 06:09
`Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 19503, 0` — user8261831, May 13 '21 at 06:12
In your edit are the fields generic, I can't seem to get it running? — user8261831, May 13 '21 at 06:50
I just ran it in a clean R session and my edited example does run and should give the desired output. `db` needs to be a `data.frame`. — edsandorf, May 13 '21 at 07:06

score 2 · Answer 2 · answered May 13 '21 at 06:25

Or simply

test_df %>% 
  pivot_wider(names_from = code, values_fill = 0, values_fn = length, values_from = code)

# A tibble: 8 x 5
  ID      AA1   AA3   AA4   AA2
  <chr> <int> <int> <int> <int>
1 9348      1     0     0     0
2 8333      1     0     0     0
3 6449      0     1     0     0
4 8525      1     0     0     0
5 5306      0     0     1     0
6 1230      1     0     0     0
7 3039      0     0     0     1
8 2376      1     0     0     0

Thanks to @jared_mamrot for data

test_df <- data_frame("ID" = c("9348", "8333", "6449", "8525", "5306", "1230", "3039", "2376"),
                      "code" = c("AA1", "AA1", "AA3", "AA1", "AA4", "AA1", "AA2", "AA1"))

This can't handle blank codes – user8261831 May 13 '21 at 06:31 — user8261831, May 13 '21 at 06:31
Means you also want a duplicate `code` column in output? – AnilGoyal May 13 '21 at 06:34 — AnilGoyal, May 13 '21 at 06:34

score 0 · Answer 3 · edited May 13 '21 at 11:04

0

Try tidyr::pivot_wider() and use your column for names_from argument.

If you want 0s instead of NANs in your result, use the argument values_fill = 0

You might need to play with the code by adding a variable of all 1s to your original df and use it for values_from arg, but I am not sure as the question is not complete.

Org_df %>% 
mutate( dummy_value = 1 ) %>% 
pivot_wider( id_cols = id, names_from = code, values_from = dummy_value, values_fill = 0 )

edited May 13 '21 at 11:04

James Mudd

814
13
16

answered May 13 '21 at 05:57

Shaahin

9
1

This is basically what I did - nice answer :) – jared_mamrot May 13 '21 at 06:14

using a loop for creating multiple dummy variables

3 Answers3