1

Apologies if this is a simple question, but I haven't been able to find a simple solution after searching. I'm fairly new to R, and am having trouble converting wide format to long format using either the melt (reshape2) or gather(tidyr) functions. The dataset that I'm working with contains 22 different time variables that are each 3 time periods. The problem occurs when I try to convert all of these from wide to long format at once. I have had success in converting them individually, but it's a very inefficient and long, so I was wondering if anyone could suggest a simpler solution. Below is a sample dataset I created that is formatted in a similar way as the dataset I am working with:

Subject <- c(1, 2, 3)
BlueTime1 <- c(2, 5, 6)
BlueTime2 <- c(4, 6, 7)
BlueTime3 <- c(1, 2, 3)
RedTime1 <- c(2, 5, 6)
RedTime2 <- c(4, 6, 7)
RedTime3 <- c(1, 2, 3)
GreenTime1 <- c(2, 5, 6)
GreenTime2 <- c(4, 6, 7)
GreenTime3 <- c(1, 2, 3)

sample.df <- data.frame(Subject, BlueTime1, BlueTime2, BlueTime3,
                    RedTime1, RedTime2, RedTime3,
                    GreenTime1,GreenTime2, GreenTime3)

A solution that has worked for me is to use the gather function from tidyr, arranging the data by Subject (so that each subject's data is grouped together), and then selecting only the subject, time period, and rating. This was done for each variable (in my case 22).

install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)

BlueGather <- gather(sample.df, Time_Blue, Rating_Blue, c(BlueTime1,
                                                          BlueTime2,
                                                          BlueTime3))
BlueSorted <- arrange(BlueGather, Subject)

BlueSubtracted <- select(BlueSorted, Subject, Time_Blue, Rating_Blue)

After this code, I combine everything into one data frame. This seems very slow and inefficient to me, and was hoping that someone could help me find a simpler solution. Thank you!

Eugene
  • 33
  • 2
  • 7

3 Answers3

2

We can use melt from data.table which can take multiple measure columns as a regex pattern

library(data.table)
melt(setDT(sample.df), measure = patterns("^Blue", "^Red", "^Green"), 
     value.name = c("BlueTime", "RedTime", "GreenTime"), variable.name = "time")
#   Subject time BlueTime RedTime GreenTime
#1:       1    1        2       2         2
#2:       2    1        5       5         5
#3:       3    1        6       6         6
#4:       1    2        4       4         4
#5:       2    2        6       6         6
#6:       3    2        7       7         7
#7:       1    3        1       1         1
#8:       2    3        2       2         2
#9:       3    3        3       3         3

Or as @StevenBeaupré mentioned in the comments, if there are many patterns, one option would be to use the names of the dataset after extracting the substring as the patterns argument

melt(setDT(sample.df), measure = patterns(as.list(unique(sub("\\d+", "", 
         names(sample.df)[-1])))),value.name = c("BlueTime", "RedTime", 
          "GreenTime"), variable.name = "time") 
akrun
  • 674,427
  • 24
  • 381
  • 486
  • 1
    OP says he has *22 different time variables that are each 3 time periods* I think we should provide a solution without explicitly refering to the colors. – Steven Beaupré Jul 21 '16 at 17:53
  • @StevenBeaupré This solution is based on the example he provided. As it has only 22 variables, the `patterns` can still take it.. – akrun Jul 21 '16 at 17:55
  • Yes I know and it's a very legitimate solution for the example provided +1 for that. But since he clearly mentions his real use case has many time variables, I think we should take that into consideration in providing OP with an answer that is going to be well suited to his real life application. – Steven Beaupré Jul 21 '16 at 17:56
  • @StevenBeaupré Thanks, honestly I didn't read the description, I was following the output in your solution. – akrun Jul 21 '16 at 17:57
  • Hehe no worries :) I'm thinking we could automate the construction of the pattern using the colnames of `df` maybe ? – Steven Beaupré Jul 21 '16 at 17:59
  • 1
    @StevenBeaupré Thanks for the idea. I updated it. – akrun Jul 21 '16 at 18:03
1

If your goal is to convert the three colors to long this can be accomplished with the base R reshape function:

reshape(sample.df, idvar="subject", varying=2:length(sample.df), sep="", direction="long")
    Subject time BlueTime RedTime GreenTime subject
1.1       1    1        2       2         2       1
2.1       2    1        5       5         5       2
3.1       3    1        6       6         6       3
1.2       1    2        4       4         4       1
2.2       2    2        6       6         6       2
3.2       3    2        7       7         7       3
1.3       1    3        1       1         1       1
2.3       2    3        2       2         2       2
3.3       3    3        3       3         3       3

The time variable captures the 1,2,3 in the names of the wide variables. The varying argument tells reshape which variables should be converted to long. The sep argument tells reshape to look for numbers at the end of the varying variables that are not separated by any characters, while the direction argument tells the function to attempt a long conversion.

I always add the id variable, even if it is not necessary for future reference.


If your data.frame doesn't have actually have the numbers for the time variable, a fairly simple solution is to change the variable names so that they do. For example, the following would replace "_Pre" with "1" at the end of any such variables.

names(df)[grep("_Pre$", names(df))] <- gsub("_Pre$", "1",
                                            names(df)[grep("_Pre$", names(df))])
lmo
  • 35,764
  • 9
  • 49
  • 57
  • Thanks for the quick response! I'm having a bit of trouble applying this to my real dataset. I'm getting the following error: ` code` Error in [ – Eugene Jul 21 '16 at 14:56
  • I'll post a suggested method to resolve this. – lmo Jul 21 '16 at 15:00
1

The idea here is to gather() all the time variables (all variables but Subject), use separate() on key to split them into a label and a time and then spread() the label and value to obtain your desired output.

library(dplyr)
library(tidyr)

sample.df %>%
  gather(key, value, -Subject) %>%
  separate(key, into = c("label", "time"), "(?<=[a-z])(?=[0-9])") %>%
  spread(label, value)

Which gives:

#  Subject time BlueTime GreenTime RedTime
#1       1    1        2         2       2
#2       1    2        4         4       4
#3       1    3        1         1       1
#4       2    1        5         5       5
#5       2    2        6         6       6
#6       2    3        2         2       2
#7       3    1        6         6       6
#8       3    2        7         7       7
#9       3    3        3         3       3

Note

Here we use the regex in separate() from this answer by @RichardScriven to split the column on the first encountered digit.


Edit

I understand from your comments that your dataset column names are actually in the form ColorTime_Pre, ColorTime_Post, ColorTime_Final. If that is the case, you don't have to specify a regex in separate() as the default one sep = "[^[:alnum:]]+" will match your _ and split the key into label and time accordingly:

sample.df %>%
  gather(key, value, -Subject) %>%
  separate(key, into = c("label", "time")) %>%
  spread(label, value)

Will give:

#  Subject  time BlueTime GreenTime RedTime
#1       1 Final        1         1       1
#2       1  Post        4         4       4
#3       1   Pre        2         2       2
#4       2 Final        2         2       2
#5       2  Post        6         6       6
#6       2   Pre        5         5       5
#7       3 Final        3         3       3
#8       3  Post        7         7       7
#9       3   Pre        6         6       6
Community
  • 1
  • 1
Steven Beaupré
  • 20,095
  • 7
  • 52
  • 73