1

Despite the conventions of R, data collection and entry is for me most easily done in vertical columns. Therefore, I have a question about efficiently converting to horizontal rows with the gather() function in the tidyverse library. I find myself using gather() over and over which seems inefficient. Is there a more efficient way? And can an existing vector serve as the key? Here is an example:

Let's say we have the following health metrics on baby birds.

    bird day_1_mass day_2_mass day_1_heart_rate day_3_heart_rate
1    1          5          6               60               55
2    2          6          8               62               57
3    3          3          3               45               45

Using the gather function I can reorganize the mass data into rows.

horizontal.data <- gather(vertical.data,
                      key = age, 
                      value = mass, 
                      day_1_mass:day_2_mass, 
                      factor_key=TRUE)

Giving us

    bird day_1_heart_rate day_3_heart_rate        age mass
1    1               60               55 day_1_mass    5
2    2               62               57 day_1_mass    6
3    3               45               45 day_1_mass    3
4    1               60               55 day_2_mass    6
5    2               62               57 day_2_mass    8
6    3               45               45 day_2_mass    3

And use the same function again to similarly reorganize heart rate data.

horizontal.data.2 <- gather(horizontal.data, 
                            key = age2, 
                            value = heart_rate, 
                            day_1_heart_rate:day_3_heart_rate, 
                            factor_key=TRUE)

Producing a new dataframe

     bird        age mass             age2 heart_rate
1     1 day_1_mass    5 day_1_heart_rate         60
2     2 day_1_mass    6 day_1_heart_rate         62
3     3 day_1_mass    3 day_1_heart_rate         45
4     1 day_2_mass    6 day_1_heart_rate         60
5     2 day_2_mass    8 day_1_heart_rate         62
6     3 day_2_mass    3 day_1_heart_rate         45
7     1 day_1_mass    5 day_3_heart_rate         55
8     2 day_1_mass    6 day_3_heart_rate         57
9     3 day_1_mass    3 day_3_heart_rate         45
10    1 day_2_mass    6 day_3_heart_rate         55
11    2 day_2_mass    8 day_3_heart_rate         57
12    3 day_2_mass    3 day_3_heart_rate         45

So it took two steps, but it worked. The questions are 1) Is there a way to do this in one step? and 2) Can it alternatively be done with one key (the "age" vector) that I can then simply replace as numeric data?

MasonK
  • 61
  • 5
  • Why don't you just use pipe function to combine both steps. I don't think there would be a way to include your desired operation within one `gather()` function >> `gather(vertical.data, key = age, value = mass, day_1_mass:day_2_mass, factor_key=TRUE) %>% gather(key = age2, value = heart_rate, day_1_heart_rate:day_3_heart_rate, factor_key=TRUE)` – M_Shimal Jun 22 '17 at 20:22
  • Possible duplicate of [Gather multiple sets of columns](https://stackoverflow.com/questions/25925556/gather-multiple-sets-of-columns) – aosmith Jun 22 '17 at 20:58

3 Answers3

1

if I get the question right, you could do that by first gathering everything together, and then "spreading" on mass and heart rate:

library(forcats)
library(dplyr)

mass_levs       <- names(vertical.data)[grep("mass", names(vertical.data))]
hearth_levs     <- names(vertical.data)[grep("heart", names(vertical.data))]
horizontal.data <- vertical.data %>% 
                     gather(variable, value, -bird,  factor_key = TRUE) %>% 
                     mutate(day = stringr::str_sub(variable, 5,5)) %>% 
                     mutate(variable = fct_collapse(variable,  
                                                    "mass" = mass_levs, 
                                                    "hearth_rate" = hearth_levs)) %>% 
                     spread(variable, value)

, giving:

  bird day mass hearth_rate
1    1   1    5          60
2    1   2    6          NA
3    1   3   NA          55
4    2   1    6          62
5    2   2    8          NA
6    2   3   NA          57
7    3   1    3          45
8    3   2    3          NA
9    3   3   NA          45

we can see how it works by going through the pipe one pass at a time. First, we gather everyting on a long format:

horizontal.data <- vertical.data %>% 
                     gather(variable, value, -bird,  factor_key = TRUE)

  bird         variable value
1     1       day_1_mass     5
2     2       day_1_mass     6
3     3       day_1_mass     3
4     1       day_2_mass     6
5     2       day_2_mass     8
6     3       day_2_mass     3
7     1 day_1_heart_rate    60
8     2 day_1_heart_rate    62
9     3 day_1_heart_rate    45
10    1 day_3_heart_rate    55
11    2 day_3_heart_rate    57
12    3 day_3_heart_rate    45

then, if we want to keep a "proper" long table, as the OP suggested we have to create a single key variable. In this case, it makes sense to use the day (= age). To create the day variable, we can extract it from the character strings now in variable:

%>% mutate(day = stringr::str_sub(variable, 5,5))

here, str_sub gets the substring in position 5, which is the day (note that if in the full dataset you have multiple-digits days, you'll have to tweak this a bit, probably by splitting on _):

   bird         variable value day
1     1       day_1_mass     5   1
2     2       day_1_mass     6   1
3     3       day_1_mass     3   1
4     1       day_2_mass     6   2
5     2       day_2_mass     8   2
6     3       day_2_mass     3   2
7     1 day_1_heart_rate    60   1
8     2 day_1_heart_rate    62   1
9     3 day_1_heart_rate    45   1
10    1 day_3_heart_rate    55   3
11    2 day_3_heart_rate    57   3
12    3 day_3_heart_rate    45   3

now, to finish we have to "spread " the table to have a mass and a heart rate column.

Here we have a problem, because currently there are 2 levels each corresponding to mass and hearth rate in the variable column. Therefore, applying spread on variable would give us again four columns.

To prevent that, we need to aggregate the four levels in variable into two levels. We can do that by using forcats::fc_collapse, by providing the association between the new level names and the "old" ones. Outside of a pipe, that would correspond to:

horizontal.data$variable <- fct_collapse(horizontal.data$variable, 
                                mass = c("day_1_mass", "day_2_mass",
                                heart = c("day_1_hearth_rate", "day_3_heart_rate")

However, if you have many levels it is cumbersome to write them all. Therefore, I find beforehand the level names corresponding to the two "categories" using

mass_levs       <- names(vertical.data)[grep("mass", names(vertical.data))]
hearth_levs     <- names(vertical.data)[grep("heart", names(vertical.data))]

mass_levs
[1] "day_1_mass" "day_2_mass"
hearth_levs
[1] "day_1_heart_rate" "day_3_heart_rate"

therefore, the third line of the pipe can be shortened to:

%>% mutate(variable = fct_collapse(variable,  
                              "mass" = mass_levs, 
                              "hearth_rate" = hearth_levs))

, after which we have:

  bird    variable value day
1     1        mass     5   1
2     2        mass     6   1
3     3        mass     3   1
4     1        mass     6   2
5     2        mass     8   2
6     3        mass     3   2
7     1 hearth_rate    60   1
8     2 hearth_rate    62   1
9     3 hearth_rate    45   1
10    1 hearth_rate    55   3
11    2 hearth_rate    57   3
12    3 hearth_rate    45   3

, so that we are now in the condition to "spread" the table again according to variable using:

%>% spread(variable, value)

  bird day mass hearth_rate
1    1   1    5          60
2    1   2    6          NA
3    1   3   NA          55
4    2   1    6          62
5    2   2    8          NA
6    2   3   NA          57
7    3   1    3          45
8    3   2    3          NA
9    3   3   NA          45

HTH

lbusett
  • 5,191
  • 2
  • 18
  • 43
  • Yeah that works beautifully and is an elegant script. It would serve to better "teach a man to fish" as it were if you could annotate a bit, specifically what you're doing within mutate with stringr(?), and fct_collapse. – MasonK Jun 22 '17 at 21:17
  • Ok. no it should be much more clear. Thanks for the suggestion – lbusett Jun 22 '17 at 21:58
  • Great explanation! – MasonK Jun 22 '17 at 23:06
0

If you insist on a single command , i can give you one

setup the data.frame

c1<-c(1,2,3)
c2<-c(5,6,3)
c3<-c(6,8,3)
c4<-c(60,62,45)
c5<-c(55,57,45)
dt<-as.data.table(cbind(c1,c2,c3,c4,c5))
colnames(dt)<-c("bird","day_1_mass","day_2_mass","day_1_heart_rate","day_3_heart_rate")

Now use this single command to get the final outcome

merge(melt(dt[,c("bird","day_1_mass","day_2_mass")],id.vars = c("bird"),variable.name = "age",value.name="mass"),melt(dt[,c("bird","day_1_heart_rate","day_3_heart_rate")],id.vars = c("bird"),variable.name = "age2",value.name="heart_rate"),by = "bird")

The final outcome is

     bird        age     mass             age2 heart_rate
 1:    1 day_1_mass         5 day_1_heart_rate         60
 2:    1 day_1_mass         5 day_3_heart_rate         55
 3:    1 day_2_mass         6 day_1_heart_rate         60
 4:    1 day_2_mass         6 day_3_heart_rate         55
 5:    2 day_1_mass         6 day_1_heart_rate         62
 6:    2 day_1_mass         6 day_3_heart_rate         57
 7:    2 day_2_mass         8 day_1_heart_rate         62
 8:    2 day_2_mass         8 day_3_heart_rate         57
 9:    3 day_1_mass         3 day_1_heart_rate         45
10:    3 day_1_mass         3 day_3_heart_rate         45
11:    3 day_2_mass         3 day_1_heart_rate         45
12:    3 day_2_mass         3 day_3_heart_rate         45
niths4u
  • 432
  • 2
  • 12
0

Though already answered, I have a different solution in which you save a list of the gather parameters you would like to run, and then run the gather_() command for each set of parameters in the list.

# Create a list of gather parameters
# Format is key, value, columns_to_gather
gather.list <- list(c("age", "mass", "day_1_mass", "day_2_mass"),
                    c("age2", "heart_rate", "day_1_heart_rate", "day_3_heart_rate"))

# Run gather command for each list item
for(i in gather.list){
  df <- gather_(df, key_col = i[1], value_col = i[2], gather_cols = c(i[3:length(i)]), factor_key = TRUE)
}
Matt Jewett
  • 2,857
  • 1
  • 10
  • 19