2

I have a data like the following and I would like to convert it into long format.

id count a1     b1 c1   a2     b2 c2  a3    b3  c3  age
1  1     apple  2  3    orange 3  2   beer   2   1   50
1  2     orange 3  2    apple  2  2   beer   2   1   50
2  1     pear   3  2    apple  2  2   orange 2   2   45

[a1,b1,c1],[a2,b2,c2],[a3,b3,c3] are the set of three attributes that person with an assigned id is facing and this person may face multiple choice situations with count indicating the ith choice situation. I want to change it back to a long format while keep the other variables like the following:

id count    a    b  c  age
1  1      apple  2  3  50
1  1      orange 3  2  50
1  1      beer   2  1  50
1  2      orange 3  2  50
1  2      apple  2  2  50
1  2      beer   2  1  50
2  1      pear   3  2  45
2  1      apple  2  2  45
2  1      orange 2  2  45

I have tried reshape with the following commands, but I get confused in terms of where to deal with timevar and times:

 l <- reshape(df, 
           varying = df[,3:11],
           v.names = c("a","b","c"),
           timevar = "choice", 
           times = c("a","b","c"), 
           direction = "long")

with the above commands, I cannot the result I want, would sincerely appreciate any help!

lll
  • 975
  • 1
  • 9
  • 23

3 Answers3

4

Use the melt function from data.table package:

library(data.table)
setDT(df)
melt(df, id.vars = c('id', 'count', 'age'),  
         measure = patterns('a\\d', 'b\\d', 'c\\d'), 
         # this needs to be regular expression to group `a1, a2, a3` etc together and 
         # the `\\d` is necessary because you have an age variable in the column.
         value.name = c('a', 'b', 'c'))[, variable := NULL][order(id, count, -age)]

#    id count age      a b c
# 1:  1     1  50  apple 2 3
# 2:  1     1  50 orange 3 2
# 3:  1     1  50   beer 2 1
# 4:  1     2  50 orange 3 2
# 5:  1     2  50  apple 2 2
# 6:  1     2  50   beer 2 1
# 7:  2     1  45   pear 3 2
# 8:  2     1  45  apple 2 2
# 9:  2     1  45 orange 2 2
Psidom
  • 171,477
  • 20
  • 249
  • 286
3

To use the reshape function, you just have to adjust the varying argument. It can be a list and you want to put the variables that will make up the same column together as vectors in a list:

reshape(df, 
        idvar=c("id", "count", "age"),
        varying = list(c(3,6,9), c(4,7,10), c(5,8,11)),
        timevar="time",
        v.names=c("a", "b", "c"), 
        direction = "long")

This returns

         id count age time      a b c
1.1.50.1  1     1  50    1  apple 2 3
1.2.50.1  1     2  50    1 orange 3 2
2.1.45.1  2     1  45    1   pear 3 2
1.1.50.2  1     1  50    2 orange 3 2
1.2.50.2  1     2  50    2  apple 2 2
2.1.45.2  2     1  45    2  apple 2 2
1.1.50.3  1     1  50    3   beer 2 1
1.2.50.3  1     2  50    3   beer 2 1
2.1.45.3  2     1  45    3 orange 2 2

I also added in the idvars as I think this is usually good practice for others or for re-reading your old code.

data

df <- read.table(header=T, text="id count a1     b1 c1   a2     b2 c2  a3    b3  c3  age
1  1     apple  2  3    orange 3  2   beer   2   1   50
1  2     orange 3  2    apple  2  2   beer   2   1   50
2  1     pear   3  2    apple  2  2   orange 2   2   45")
lmo
  • 35,764
  • 9
  • 49
  • 57
  • when i tried, it says Error in .subset(x, j) : invalid subscript type 'list' – lll Jul 19 '16 at 02:21
  • I just started a new R session and ran the code with the data that I just added without an error. – lmo Jul 19 '16 at 02:26
  • 1
    You don't need to manually specify the varying groups or the names, `reshape()` can guess it `reshape(dat, idvar=c("id","count","age"), direction="long", varying=3:11, sep="")` The `sep=""` just states that each `a/b/c` and `1/2/3` are separated by nothing `""`, allowing the `timevar` and the `v.names` to be inferred. – thelatemail Jul 19 '16 at 02:31
  • 1
    but i sometimes, if i used varying = xx:xx, I will get an error saying failed to guess time-varying variables from their names. can you tell me what is this error saying? – lll Jul 19 '16 at 02:37
  • 1
    @lll - `sep=` has to be specified if the variables are not in the form `grp.1 grp.2 othergrp.1 othergrp.2` etc as `sep=.` is the default. – thelatemail Jul 19 '16 at 02:39
  • thanks! Can you tell me what does sep = "" imply here – lll Jul 19 '16 at 02:54
  • Sure. Here is a quote from the help file: "A character vector of length 1, indicating a separating character in the variable names in the wide format. This is used for guessing v.names and times arguments based on the names in varying. If sep == "", the split is just before the first numeral that follows an alphabetic character." In this case, it says separate "a1" and "b1" into "a" "1" and "b" "1" during the reshape to long. the "a" and "b" are used as variable names while the "1"s are added to the timevar, which the code above names "time." – lmo Jul 19 '16 at 03:08
3

We can use dplyr/tidyr

library(dplyr)
library(tidyr)
gather(df1, Var, Val, a1:c3) %>%
       extract(Var, into = c("Var1", "Var2"), "(.)(.)") %>%
       spread(Var1, Val) %>%
       select(-Var2)
#   id count age      a b c
#1  1     1  50  apple 2 3
#2  1     1  50 orange 3 2
#3  1     1  50   beer 2 1
#4  1     2  50 orange 3 2
#5  1     2  50  apple 2 2
#6  1     2  50   beer 2 1
#7  2     1  45   pear 3 2
#8  2     1  45  apple 2 2
#9  2     1  45 orange 2 2
akrun
  • 674,427
  • 24
  • 381
  • 486