1

I would appreciate any help to create new variables from one variable.

Specifically, I need help to simultaneously create one row per each ID and various columns of E, where each of the new columns of E, (that is, E1, E2, E3) contains the values of E for each row of ID. I tried doing this which melt followed by spread but I am getting the error:

Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)

Additionally, I tried the solutions discussed here and here but these did not work for my case because I need to be able to create row identifiers for rows (4, 1, 2), (7, 3, 5), and (9, 6, 8). That is, E for rows (4, 1, 2) should be named E1, E for rows (7, 3, 5) should be named E2, E for rows (9, 6, 8) should be named E3, and so on.

#data

dT<-structure(list(A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1", 
    "a2", "a1"), B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1", 
    "b2", "b1"), ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"
    ), E = c(0.621142094943352, 0.742109450696123, 0.39439152996948, 
    0.40694392882818, 0.779607277916503, 0.550579323666347, 0.352622183880119, 
    0.690660491345867, 0.23378944873769)), class = c("data.table", 
    "data.frame"), row.names = c(NA, -9L))

#my attempt

    A  B ID         E
1: a1 b2  3 0.6211421
2: a2 b2  4 0.7421095
3: a1 b2  3 0.3943915
4: a1 b1  1 0.4069439
5: a2 b2  4 0.7796073
6: a1 b2  3 0.5505793
7: a1 b1  1 0.3526222
8: a2 b2  4 0.6906605
9: a1 b1  1 0.2337894

aTempDF <- melt(dT, id.vars = c("A", "B", "ID")) )

    A  B  ID variable    value
1: a1 b2  3        E 0.6211421
2: a2 b2  4        E 0.7421095
3: a1 b2  3        E 0.3943915
4: a1 b1  1        E 0.4069439
5: a2 b2  4        E 0.7796073
6: a1 b2  3        E 0.5505793
7: a1 b1  1        E 0.3526222
8: a2 b2  4        E 0.6906605
9: a1 b1  1        E 0.2337894

aTempDF%>%spread(variable, value)

Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)

#expected output

    A  B  ID       E1           E2           E3
1: a1 b2  3        0.6211421    0.3943915    0.5505793
2: a2 b2  4        0.7421095    0.7796073    0.6906605 
3: a1 b1  1        0.4069439    0.3526222    0.2337894

Thanks in advance for any help.

Samuel Liew
  • 68,352
  • 105
  • 140
  • 225
Krantz
  • 1,223
  • 1
  • 8
  • 20
  • 1
    The original question has already been answered. Please create a new question instead of invalidating existing answers. – Samuel Liew Feb 01 '19 at 02:00

2 Answers2

1

You can use dcast from data.table

library(data.table)
dcast(dT, A + B + ID ~ paste0("E", rowid(ID)))
#   A  B ID        E1        E2        E3
#1 a1 b1  1 0.4069439 0.3526222 0.2337894
#2 a1 b2  3 0.6211421 0.3943915 0.5505793
#3 a2 b2  4 0.7421095 0.7796073 0.6906605

You need to create the correct 'time variable' first which is what rowid(ID) does.

markus
  • 23,189
  • 5
  • 29
  • 47
  • Hi @markus. Strangely, it seems that the solution works well for a small dataset (sample) but does not work for the actual dataset to which I want to use. I have edited and attached the bigger data. Thanks in advance for any help. – Krantz Feb 01 '19 at 00:17
1

For those looking for a tidyverse solution:

library(tidyverse)

dT <- structure(
  list(
    A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1", "a2", "a1"),
    B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1", "b2", "b1"),
    ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"),
    E = c(0.621142094943352, 0.742109450696123, 0.39439152996948, 0.40694392882818,
          0.550579323666347, 0.352622183880119, 0.690660491345867, 0.23378944873769,
          0.779607277916503)),
  class = c("data.table", 
            "data.frame"),
  row.names = c(NA, -9L))
dT %>% 
  as_tibble() %>%  # since dataset is a data.table object
  group_by(A, B, ID) %>% 
  # Just so columns are "E1", "E2", etc.
  mutate(rn = glue::glue("E{row_number()}")) %>% 
  ungroup() %>% 
  spread(rn, E) %>%
  # not necessary, just making output in the same order as your expected output
  arrange(desc(B)) 
# A tibble: 3 x 6
#  A     B     ID       E1    E2    E3
#  <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1    b2    3     0.621 0.394 0.551
#2 a2    b2    4     0.742 0.780 0.691
#3 a1    b1    1     0.407 0.353 0.234

As mentioned in the accepted answer, you need a "key" variable to spread on first. This is created using row_number() and glue where glue just gives you the proper E1, E2, etc. variable names.

The group_by piece just makes sure that the row numbers are with respect to A, B and ID.


EDIT for tidyr >= 1.0.0

The (not-so) new pivot_ functions supercede gather and spread and eliminate the need to glue the new variable names together in a mutate.

dT %>% 
  as_tibble() %>%  # since dataset is a data.table object
  group_by(A, B, ID) %>% 
  # no longer need to glue (or paste) the names together but still need a row number
  mutate(rn = row_number()) %>% 
  ungroup() %>% 
  pivot_wider(names_from = rn, values_from = E, names_glue = "E{.name}") %>% # names_glue argument allows for easy transforming of the new variable names
  # not necessary, just making output in the same order as your expected output
  arrange(desc(B)) 
# A tibble: 3 x 6
#  A     B     ID       E1    E2    E3
#  <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1    b2    3     0.621 0.394 0.551
#2 a2    b2    4     0.742 0.780 0.691
#3 a1    b1    1     0.407 0.353 0.234
Matt S
  • 15
  • 7