I have a dataframe with 3 categorical variables (x
,y
,z
) along with an ID
column :
df <- frame_data(
~id, ~x, ~y, ~z,
1, "a", "c" ,"v",
1, "b", "d", "f",
2, "a", "d", "v",
2, "b", "d", "v")
I want to apply spread()
to each of the categorical variables group by ID
.
Output should be like this :
id a b c d v f
1 1 1 1 1 1 1
2 1 1 0 2 2 0
I tried doing it but I was able to do it only for one variable at once not all together .
For e.g: Applying spread only to the y
column (similarly , it can be done for x
and z
separately) but not together in a single line
df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1.00 1 1
2.00 0 2
Explaining my codes in three steps:
Step 1: count frequency
df %>% count(id,y)
id y n
<dbl> <chr> <int>
1.00 c 1
1.00 d 1
2.00 d 2
Step 2 : applying spread()
df %>% count(id,y) %>% spread(y,n)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1 1.00 1 1
2 2.00 NA 2
Step 3: Adding fill = 0
, replaces NA
which means there was zero occurrence of c
in y column for id 2
(as you can see in df)
df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1.00 1 1
2.00 0 2
Problem : In my actual data set , I have 20 such categorical variables , I can't do it one by one for all. I am looking to do it all at once.
Is it possible apply spread()
in tidyr
for all of categorical variables all together ? If not can you please suggest an alternative
Note: I also gave a try to these answers but were not helpful for this particular case:
- R spreading multiple columns with tidyr
- Is it possible to use spread on multiple columns in tidyr similar to dcast?
- Can spread() in tidyr spread across multiple value?
- Expanding columns associated with a categorical variable into multiple columns with dplyr/tidyr while retaining id variable
Additional related helpful question :
It is possible that two categorical columns (Eg: Survey dataset) have same values . Like below.
df <- frame_data(
~id, ~Do_you_Watch_TV, ~Do_you_Drive,
1, "yes", "yes",
1, "yes", "no",
2, "yes", "no",
2, "no", "yes")
# A tibble: 4 x 3
id Do_you_Watch_TV Do_you_Drive
<dbl> <chr> <chr>
1 1.00 yes yes
2 1.00 yes no
3 2.00 yes no
4 2.00 no yes
Running the below code would not differentiate counts of yes and no for 'Do_you_Watch_TV', 'Do_you_Drive' :
df %>% gather(Key, value, -id) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
id no yes
1 1 3
2 2 2
Whereas, expected output should be :
id Do_you_Watch_TV_no Do_you_Watch_TV_yes Do_you_Drive_no Do_you_Drive_yes
1 0 2 1 1
2 1 1 1 1
So , we need to treat No and Yes from Do_you_Watch_TV and Do_you_Drive separately by adding prefix. Do_you_Drive_yes , Do_you_Drive_no , Do_you_Watch_TV _yes, Do_you_Watch_TV _no .
How can we achieve this?
Thanks