2

I have a dataframe with 3 categorical variables (x,y,z) along with an ID column :

df <- frame_data(
  ~id, ~x, ~y, ~z,
  1, "a", "c" ,"v",
  1, "b", "d", "f",
  2, "a", "d", "v",
  2, "b", "d", "v")

I want to apply spread() to each of the categorical variables group by ID .

Output should be like this :

id  a  b  c  d  v  f
1  1  1  1  1  1  1
2  1  1  0  2  2  0

I tried doing it but I was able to do it only for one variable at once not all together .

For e.g: Applying spread only to the y column (similarly , it can be done for x and z separately) but not together in a single line

df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id     c     d
<dbl> <int> <int>
1.00     1     1
2.00     0     2

Explaining my codes in three steps:

Step 1: count frequency

df %>% count(id,y)    
id     y         n
<dbl> <chr> <int>
1.00   c     1
1.00   d     1
2.00   d     2

Step 2 : applying spread()

df %>% count(id,y) %>% spread(y,n)
# A tibble: 2 x 3
id     c     d
<dbl> <int> <int>
1  1.00     1     1
2  2.00    NA     2

Step 3: Adding fill = 0 , replaces NA which means there was zero occurrence of c in y column for id 2 (as you can see in df)

df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id     c     d
<dbl> <int> <int>
1.00     1     1
2.00     0     2

Problem : In my actual data set , I have 20 such categorical variables , I can't do it one by one for all. I am looking to do it all at once. Is it possible apply spread() in tidyr for all of categorical variables all together ? If not can you please suggest an alternative

Note: I also gave a try to these answers but were not helpful for this particular case:

Additional related helpful question :

It is possible that two categorical columns (Eg: Survey dataset) have same values . Like below.

df <- frame_data(
  ~id, ~Do_you_Watch_TV, ~Do_you_Drive, 
  1, "yes", "yes",
  1, "yes", "no",
  2, "yes", "no",
  2, "no", "yes")

# A tibble: 4 x 3
id Do_you_Watch_TV Do_you_Drive
<dbl> <chr>           <chr>       
  1  1.00 yes             yes         
2  1.00 yes             no          
3  2.00 yes             no          
4  2.00 no              yes 

Running the below code would not differentiate counts of yes and no for 'Do_you_Watch_TV', 'Do_you_Drive' :

df %>% gather(Key, value, -id) %>% 
  group_by(id, value) %>%
  summarise(count = n())  %>%
  spread(value, count, fill = 0) %>%
  as.data.frame()
id no yes
1  1   3
2  2   2

Whereas, expected output should be :
id Do_you_Watch_TV_no   Do_you_Watch_TV_yes  Do_you_Drive_no   Do_you_Drive_yes
1         0               2                    1                 1
2         1               1                    1                 1

So , we need to treat No and Yes from Do_you_Watch_TV and Do_you_Drive separately by adding prefix. Do_you_Drive_yes , Do_you_Drive_no , Do_you_Watch_TV _yes, Do_you_Watch_TV _no .

How can we achieve this?

Thanks

Jaap
  • 71,900
  • 30
  • 164
  • 175
Ravijeet
  • 73
  • 1
  • 11
  • Hi @Jaap : The question above has two parts. **The first part is answered partly by you here** : [link] https://stackoverflow.com/questions/43262085/using-group-by-create-aggregated-counts-conditional-on-value . Your code suggests to mention columns explicitly within gather : `tidyr :: spread library(dplyr) library(tidyr) dat %>% gather(key, val, state:type2) %>% group_by(serialno, val) %>% tally() %>% spread(val, n, fill = 0)` Whereas I am looking for doing it implicitly. So,Instead of `gather(Key, value, state:type2)` , we can use `gather(Key, value, -serialno)` – Ravijeet Jun 10 '18 at 10:28
  • @JAAP **The second part is : How do we do the same if we have an added condition - if two different category columns have same value ?** The answer to it is also not here [link] https://stackoverflow.com/questions/43262085/using-group-by-create-aggregated-counts-conditional-on-value . So, this question is possibly not a duplicate. Both parts have been answered below – Ravijeet Jun 10 '18 at 10:33
  • Thanks for adding second link , but does that cover using `tidyr::spread()` ? Can you add the link here if any ? The first link is partially asnwered . Second link : uses `dcast()` My question is related using `tidyr::spread()` as it performs better than `dcast()` (time of execution) . Kindly review my title and question asked – Ravijeet Jun 10 '18 at 11:04
  • (1) The first link answer the first part of your question fully. `gather(Key, value, state:type2)` is the same as `gather(Key, value, -serialno)`, [see here](https://stackoverflow.com/a/25856135/2204410). (2) The two additional links cover the second part of your question. – Jaap Jun 10 '18 at 13:00

1 Answers1

2

First you need to convert your data frame in long format before you can actually transform it in wide format. Hence, first you need to use tidyr::gather and convert data frame to long format. Afterwards, you have couple of options:

Option#1: Using tidyr::spread:

#data
df <- frame_data(
  ~id, ~x, ~y, ~z,
  1, "a", "c" ,"v",
  1, "b", "d", "f",
  2, "a", "d", "v",
  2, "b", "d", "v")

library(tidyverse)
df %>% gather(Key, value, -id) %>% 
  group_by(id, value) %>%
  summarise(count = n())  %>%
  spread(value, count, fill = 0) %>%
  as.data.frame()

#   id a b c d f v
# 1  1 1 1 1 1 1 1
# 2  2 1 1 0 2 0 2

Option#2: Another option can be is to use reshape2::dcast as:

library(tidyverse)
library(reshape2)

df %>% gather(Key, value, -id) %>% 
  dcast(id~value, fun.aggregate = length)

#   id a b c d f v
# 1  1 1 1 1 1 1 1
# 2  2 1 1 0 2 0 2

Edited: To include solution for 2nd data frame.

#Data
df1 <- frame_data(
  ~id, ~Do_you_Watch_TV, ~Do_you_Drive, 
  1, "yes", "yes",
  1, "yes", "no",
  2, "yes", "no",
  2, "no", "yes")

library(tidyverse)
df1 %>% gather(Key, value, -id) %>% unite("value", c(Key, value)) %>%
  group_by(id, value) %>%
  summarise(count = n())  %>%
  spread(value, count, fill = 0) %>%
  as.data.frame()

#   id Do_you_Drive_no Do_you_Drive_yes Do_you_Watch_TV_no Do_you_Watch_TV_yes
# 1  1               1                1                  0                   2
# 2  2               1                1                  1                   1
MKR
  • 18,633
  • 4
  • 20
  • 32
  • 2
    if not using `spread` is allowed, option#3: `df %>% gather(var, val, x:z) %>% {xtabs(~ id + val, data = .)}` and ... . – mt1022 Jun 10 '18 at 07:08
  • @mt1022 - option# 3 works but we need to specify column names explicitly like x:z . For implicit way of doing , I would prefer going with option#1 or #2 – Ravijeet Jun 10 '18 at 07:33
  • 1
    @Ravijeet, agreed. The key message is that we can achieve the desired output with many alternative ways that are simpler than `spread`-based approach. – mt1022 Jun 10 '18 at 07:38
  • @MKR - That works. But, I have one more related question : how can we add prefix (parent column i.e. x_) to a , b , c ? It becomes x_a ,x_b,x_c . I just realized there can be two categorical columns with same levels . Eg: x having levels 'Yes' and 'No' and y also having 'Yes' and 'No' . So , we need to treat them separately by adding prefix. x_yes , x_no , y_yes, y_no – Ravijeet Jun 10 '18 at 07:41
  • @Ravijeet You can create columns as `x_a`, `x_b` etc. You need to add `unite("value", c(Key, value)) %>%` before `group_by` in option#1. But I really dont understand what you mean by `Yes` and `No` part. It would be better to add and small example for expected output. – MKR Jun 10 '18 at 07:48
  • 1
    @MKR : Added the example and expected output in the question. `unite("value", c(Key, value))` - Works to take care of this related question too . Thanks – Ravijeet Jun 10 '18 at 08:26