Using spread with duplicate identifiers for rows

Question

I have a long form dataframe that have multiple entries for same date and person.

jj <- data.frame(month=rep(1:3,4),
             student=rep(c("Amy", "Bob"), each=6),
             A=c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5),
             B=c(6, 7, 8, 5, 6, 7, 5, 4, 6, 3, 1, 5))

I want to convert it to wide form and make it like this:

month Amy.A Bob.A Amy.B Bob.B
1     
2     
3
1
2
3
1
2
3
1
2
3

My question is very similar to this. I have used the given code in the answer :

kk <- jj %>% 
  gather(variable, value, -(month:student)) %>% 
  unite(temp, student, variable) %>% 
  spread(temp, value)

but it gives following error:

Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6), (13, 16), (14, 17), (15, 18), (7, 10), (8, 11), (9, 12), (19, 22), (20, 23), (21, 24)

Thanks in advance. Note: I don't want to delete multiple entries.

The output doesn't make sense. How is Bob.B `5 6 7`. There are two Bob B values for month 1, `5 3`? And for month 2 it is `4 and 2`. Finally, for month 3 `6 5`. You are summarizing these to one value. — Pierre L, Aug 20 '16 at 11:20

score 29 · Answer 1 · answered Jun 13 '17 at 02:05

29

Your answer was missing mutate id! Here is the solution using dplyr packge only.

jj %>% 
  gather(variable, value, -(month:student)) %>% 
  unite(temp, student, variable) %>% 
  group_by(temp) %>% 
  mutate(id=1:n()) %>% 
  spread(temp, value) 
#  A tibble: 6 x 6
#  month    id Amy_A Amy_B Bob_A Bob_B
# * <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1     1     1     9     6     3     5
# 2     1     4     8     5     5     3
# 3     2     2     7     7     2     4
# 4     2     5     6     6     6     1
# 5     3     3     6     8     1     6
# 6     3     6     9     7     5     5

answered Jun 13 '17 at 02:05

aliawadh980

448
5
5

2

If you don't want the *id* column, just add `%>% select(-id)` at the end. – bonna Jul 12 '17 at 11:47
3

This is a good trick. To expand briefly: if you don't have a unique id for each row before gather, there is no way to determine which values belong with which observations when spreading. Adding any columns that acts as a primary key mitigates this. – alexpghayes Sep 06 '18 at 16:43
1

This doesn't use `dplyr` only: `gather`, `unite`, and `spread` are all from `tidyr` – camille Feb 05 '20 at 20:24

Pierre L · Accepted Answer · 2016-08-20T12:07:08.387

The issue is the two columns for both A and B. If we can make that one value column, we can spread the data as you would like. Take a look at the output for jj_melt when you use the code below.

library(reshape2)
jj_melt <- melt(jj, id=c("month", "student"))
jj_spread <- dcast(jj_melt, month ~ student + variable, value.var="value", fun=sum)
#   month Amy_A Amy_B Bob_A Bob_B
# 1     1    17    11     8     8
# 2     2    13    13     8     5
# 3     3    15    15     6    11

I won't mark this as a duplicate since the other question did not summarize by sum, but the data.table answer could help with one additional argument, fun=sum:

library(data.table)
dcast(setDT(jj), month ~ student, value.var=c("A", "B"), fun=sum)
#    month A_sum_Amy A_sum_Bob B_sum_Amy B_sum_Bob
# 1:     1        17         8        11         8
# 2:     2        13         8        13         5
# 3:     3        15         6        15        11

If you would like to use the tidyr solution, combine it with dcast to summarize by sum.

as.data.frame(jj)
library(tidyr)
jj %>% 
  gather(variable, value, -(month:student)) %>%
  unite(temp, student, variable) %>%
  dcast(month ~ temp, fun=sum)
#   month Amy_A Amy_B Bob_A Bob_B
# 1     1    17    11     8     8
# 2     2    13    13     8     5
# 3     3    15    15     6    11

Edit

Based on your new requirements, I have added an activity column.

library(dplyr)
jj %>% group_by(month, student) %>% 
  mutate(id=1:n()) %>%
  melt(id=c("month", "id", "student")) %>%
  dcast(... ~ student + variable, value.var="value")
#   month id Amy_A Amy_B Bob_A Bob_B
# 1     1  1     9     6     3     5
# 2     1  2     8     5     5     3
# 3     2  1     7     7     2     4
# 4     2  2     6     6     6     1
# 5     3  1     6     8     1     6
# 6     3  2     9     7     5     5

The other solutions can also be used. Here I added an optional expression to arrange the final output by activity number:

library(tidyr)
jj %>% 
  gather(variable, value, -(month:student)) %>%
  unite(temp, student, variable) %>%
  group_by(temp) %>%
  mutate(id=1:n()) %>%
  dcast(... ~ temp) %>%
  arrange(id)
#   month id Amy_A Amy_B Bob_A Bob_B
# 1     1  1     9     6     3     5
# 2     2  2     7     7     2     4
# 3     3  3     6     8     1     6
# 4     1  4     8     5     5     3
# 5     2  5     6     6     6     1
# 6     3  6     9     7     5     5

The data.table syntax is compact because it allows for multiple value.var columns and will take care of the spread for us. We can then skip the melt -> cast process.

library(data.table)
setDT(jj)[, activityID := rowid(student)]
dcast(jj, ... ~ student, value.var=c("A", "B"))
#    month activityID A_Amy A_Bob B_Amy B_Bob
# 1:     1          1     9     3     6     5
# 2:     1          4     8     5     5     3
# 3:     2          2     7     2     7     4
# 4:     2          5     6     6     6     1
# 5:     3          3     6     1     8     6
# 6:     3          6     9     5     7     5

Thanks for answering. I do not want to sum. No arithmatical operation is needed. I want to create A and B columns for Amy, and A and B columns for Bob that simply have thier respective values. — Polar Bear, Aug 20 '16 at 11:35
If there are two values for the same month, student, and class, which one would you like to pick? — Pierre L, Aug 20 '16 at 11:36
I want both. Actually I am dealing with bid and ask data and hence there are multiple entries. — Polar Bear, Aug 20 '16 at 11:37
So you are not summarizing by the variables in question. You want a new variable that can act as an activity id. Also, do not use code snippets in your question. They do not work and the output is jumbled. Just highlight your code, paste, highlight and use Ctrl+K to indent into code readable format. — Pierre L, Aug 20 '16 at 11:40
May be a new activity id vairable do the job. I have no idea. It would be very kind of you if you can go through the linked question I posted in the question. My output should be same. The only difference is duplicate date rows for the same person with different values in all other columns. — Polar Bear, Aug 20 '16 at 11:46
Thank you very much. Would you please explain the tidyr code specially dcast(...~ temp) part? — Polar Bear, Aug 20 '16 at 12:21
We could have used `dcast(month + id ~ temp, value.var="value")`. We used a few tricks to shorten it. The three dots (all other columns) made it so that we would not have to write `month + id` and we did not have to write `value.var="value"` because the function will guess the value column by using the last column. — Pierre L, Aug 20 '16 at 12:28

score 2 · Answer 3 · answered Sep 13 '19 at 22:14

Since tidyr 1.0.0 pivot_wider is the recommended replacement of spread and you could do the following :

jj <- data.frame(month=rep(1:3,4),
                 student=rep(c("Amy", "Bob"), each=6),
                 A=c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5),
                 B=c(6, 7, 8, 5, 6, 7, 5, 4, 6, 3, 1, 5))

library(tidyr)

pivot_wider(
  jj,
  names_from = "student",
  values_from = c("A","B"),
  names_sep = ".",
  values_fn = list(A= list, B= list)) %>%
  unchop(everything())
#> # A tibble: 6 x 5
#>   month A.Amy A.Bob B.Amy B.Bob
#>   <int> <dbl> <dbl> <dbl> <dbl>
#> 1     1     9     3     6     5
#> 2     1     8     5     5     3
#> 3     2     7     2     7     4
#> 4     2     6     6     6     1
#> 5     3     6     1     8     6
#> 6     3     9     5     7     5

^{Created on 2019-09-14 by the reprex package (v0.3.0)}

The twist in this problem is that month is not unique by student, to solve this :

values_fn = list(A= list, B= list)) puts the multiple values in a list
unchop(everything()) unnest the lists vertically, you can use unnest as well here

score 1 · Answer 4 · answered Sep 14 '19 at 22:49

If we create a unique sequence, then we can the output in the correct format with pivot_wider

library(dplyr)
library(tidyr)
jj %>%
   group_by(month, student) %>% 
   mutate(rn = row_number()) %>%
   pivot_wider(names_from = 'student', values_from = c('A', 'B'), 
          names_sep='.')  %>% 
   select(-rn)
# A tibble: 6 x 5
# Groups:   month [3]
#  month A.Amy A.Bob B.Amy B.Bob
#  <int> <dbl> <dbl> <dbl> <dbl>
#1     1     9     3     6     5
#2     2     7     2     7     4
#3     3     6     1     8     6
#4     1     8     5     5     3
#5     2     6     6     6     1
#6     3     9     5     7     5

data

jj <- structure(list(month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L), student = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L), .Label = c("Amy", "Bob"), class = "factor"), 
    A = c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5), B = c(6, 7, 8, 
    5, 6, 7, 5, 4, 6, 3, 1, 5)), class = "data.frame", row.names = c(NA, 
-12L))

Using spread with duplicate identifiers for rows

4 Answers4

data

Linked

Related