Tidying table with multiple groups of wide columns, using tidyverse

Question

I often find myself in a situation where I have a table that contains multiple groups of wide columns, like so:

  replicate groupA        VA1         VA2 groupB         VB1        VB2
1         1      a  0.3429166 -2.30336406      f  0.05363582  1.6454078
2         2      b -1.3183732 -0.13516849      g -0.42586417  0.1541541
3         3      c -0.7908358 -0.10746447      h  1.05134242  1.4297350
4         4      d -0.9963677 -1.82557058      i -1.14532536  1.0815733
5         5      e -1.3634609  0.04385812      j -0.65643595 -0.1452877

And I'd like to turn the columns into one long table, like so:

   replicate group key       value
1          1     a  V1  0.34291665
2          2     b  V1 -1.31837322
3          3     c  V1 -0.79083580
4          4     d  V1 -0.99636772
5          5     e  V1 -1.36346088
6          1     a  V2 -2.30336406
7          2     b  V2 -0.13516849
8          3     c  V2 -0.10746447
9          4     d  V2 -1.82557058
10         5     e  V2  0.04385812
11         1     f  V1  0.05363582
12         2     g  V1 -0.42586417
13         3     h  V1  1.05134242
14         4     i  V1 -1.14532536
15         5     j  V1 -0.65643595
16         1     f  V2  1.64540784
17         2     g  V2  0.15415408
18         3     h  V2  1.42973499
19         4     i  V2  1.08157329
20         5     j  V2 -0.14528774

I can do this by selecting the two groups of columns individually, tidying, and then rbinding together (code below). However, this approach doesn't seem particularly elegant, and it becomes cumbersome if there are more than two groups of columns. I'm wondering whether there's a more elegant approach, using a single pipe chain of data transformations.

The fundamental question here is: How do we automate the process of breaking the table into groups of columns, tidying those, and then combining back together.

My current code:

library(dplyr)
library(tidyr)

# generate example code
df_wide <- data.frame(replicate = 1:5,
                      groupA = letters[1:5],
                      VA1 = rnorm(5),
                      VA2 = rnorm(5),
                      groupB = letters[6:10],
                      VB1 = rnorm(5),
                      VB2 = rnorm(5))

# tidy columns with A in the name
dfA <- select(df_wide, replicate, groupA, VA1, VA2) %>%
  gather(key, value, VA1, VA2) %>%
  mutate(key = case_when(key == "VA1" ~ "V1",
                         key == "VA2" ~ "V2")) %>%
  select(replicate, group = groupA, key, value)

# tidy columns with B in the name
dfB <- select(df_wide, replicate, groupB, VB1, VB2) %>%
  gather(key, value, VB1, VB2) %>%
  mutate(key = case_when(key == "VB1" ~ "V1",
                         key == "VB2" ~ "V2")) %>%
  select(replicate, group = groupB, key, value)

# combine
df_long <- rbind(dfA, dfB)

Note: Similar questions have been asked here and here, but I think the accepted answer shows that this here is a subtly different problem.

akrun · Answer 1 · 2017-11-26T05:41:59.143

1

Although the question asked for a tidyverse solution, there is a convenient option with melt from data.table, which also can take multiple patterns in the measure argument.

library(data.table)
setnames(melt(melt(setDT(df1), measure = patterns('group', 'VA', 'VB')), 
        id.var = 1:3)[, -4, with = FALSE], 2:3, c('key', 'group'))[]

2. a

with tidyverse we can subset the datasets into a list, then loop through the list with map_df convert it to 'long' format with gather to get a single data.frame

library(tidyverse)
list(df1[1:4], df1[c(1,5:7)]) %>%
      map_df(~gather(., key, value, 3:4) %>%
                   {names(.)[2] <- 'group';.}) %>%
      mutate(key = sub('(.).(.)', '\\1\\2', key))
#   replicate group key       value
#1          1     a  V1  0.34291660
#2          2     b  V1 -1.31837320
#3          3     c  V1 -0.79083580
#4          4     d  V1 -0.99636770
#5          5     e  V1 -1.36346090
#6          1     a  V2 -2.30336406
#7          2     b  V2 -0.13516849
#8          3     c  V2 -0.10746447
#9          4     d  V2 -1.82557058
#10         5     e  V2  0.04385812
#11         1     f  V1  0.05363582
#12         2     g  V1 -0.42586417
#13         3     h  V1  1.05134242
#14         4     i  V1 -1.14532536
#15         5     j  V1 -0.65643595
#16         1     f  V2  1.64540780
#17         2     g  V2  0.15415410
#18         3     h  V2  1.42973500
#19         4     i  V2  1.08157330
#20         5     j  V2 -0.14528770

2.b

If we need to split based on the occurence of 'group'

split.default(df1[-1], cumsum(grepl('group', names(df1)[-1]))) %>% 
         map(~bind_cols(df1[1], .)) %>% 
         map_df(~gather(., key, value, 3:4) %>% 
               {names(.)[2] <- 'group';.}) %>%
         mutate(key = sub('(.).(.)', '\\1\\2', key))

2.c

Included rename_at instead of names assignment in the spirit of tidyverse options

df1[-1] %>% 
      split.default(cumsum(grepl('group', names(df1)[-1]))) %>% 
      map_df(~bind_cols(df1[1], .) %>% 
           gather(., key, value, 3:4) %>%
           rename_at(2, funs(substring(.,1, 5))))

NOTE:

1) Both 2.a, 2.b, 2.c used tidyverse functions

2) It doesn't depend upon on the substring 'A' or 'B' in the column names

3) Assumed the patterns in the OP's dataset will be 'group' followed by value columns

Thanks for your answer. I'm looking specifically for a tidyverse approach, though. — Claus Wilke, Nov 25 '17 at 15:36
@mikeck I'm not sure how. If you can write it out I'd be very interested. — Claus Wilke, Nov 25 '17 at 15:49
@akrun Yes, your solution solves the issue of how to gather the two blocks of data separately. I'd prefer a solution that doesn't so heavily rely on hard-coded column indices, but that may not be possible with current tidyverse functionality. In particular, I think tidyverse would need a function that can replace the first `list()` statement. — Claus Wilke, Nov 25 '17 at 16:06
@ClausWilke If the 'group' and value columns are in the order specified in the data, then you can do a `split` i.e. `split.default(df1[-1], cumsum(grepl('group', names(df1)[-1]))) %>% map(~bind_cols(df1[1], .)) %>% map_df(~gather(., key, value, 3:4) %>% {names(.)[2] %mutate(key = sub('(.).(.)', '\\1\\2', key))` — akrun, Nov 25 '17 at 16:10
It's a close call. There's nothing wrong with your answer, but I prefer @g-grothendieck's Solution 1 because it doesn't use column indices. It's also somewhat easier to read and understand. If I could give out two check marks I would. — Claus Wilke, Nov 26 '17 at 05:23
@ClausWilke It's okay. Its only because you said something earlier about the tidyverse i.e. `I'm looking specifically for a tidyverse approach`,. So, I used `map_df` etc from the `purrr` package which belongs to `tidyverse` — akrun, Nov 26 '17 at 05:26

G. Grothendieck · Accepted Answer · 2017-11-26T12:07:30.363

1) This solution consists of a:

gather which generates the desired number of rows
a mutate which combines the groupA and groupB columns and changes the key column to that requested and
select which picks out the columns wanted.

First gather the columns whose names start with V and then create a new group column from groupA and groupB choosing groupA if the key has an A in it and groupB if the key has B in it. (We used mapply(switch, ...) here for easy extension to the 3+ group case but we could have used an ifelse, viz. ifelse(grepl("A", key), as.character(groupA), as.character(groupB)), given that we have only two groups.) The mutate also reduces the key names from VA1 to V1, etc. and finally select out the columns desired.

DF %>% 
   gather(key, value, starts_with("V")) %>%
   mutate(group = mapply(switch, gsub("[^AB]", "", key), A = groupA, B = groupB),
          key = sub("[AB]", "", key)) %>%
   select(replicate, group, key, value)

giving:

   replicate group key       value
1          1     a  V1  0.34291660
2          2     b  V1 -1.31837320
3          3     c  V1 -0.79083580
4          4     d  V1 -0.99636770
5          5     e  V1 -1.36346090
6          1     a  V2 -2.30336406
7          2     b  V2 -0.13516849
8          3     c  V2 -0.10746447
9          4     d  V2 -1.82557058
10         5     e  V2  0.04385812
11         1     f  V1  0.05363582
12         2     g  V1 -0.42586417
13         3     h  V1  1.05134242
14         4     i  V1 -1.14532536
15         5     j  V1 -0.65643595
16         1     f  V2  1.64540780
17         2     g  V2  0.15415410
18         3     h  V2  1.42973500
19         4     i  V2  1.08157330
20         5     j  V2 -0.14528770

2) Another approach would be to split the columns into groups such that all columns in a group have the same name after removing A and B from their names. Performi unlist on each such group to reduce the list to a list of plain vectors and convert that list to a data.frame. Finally gather the V columns and rearrange. Note that rownames_to_column is from the tibble package.

DF %>%
   as.list %>%
   split(sub("[AB]", "", names(.))) %>%
   lapply(unlist) %>%
   as.data.frame %>%
   rownames_to_column %>%
   gather(key, value, starts_with("V")) %>%
   arrange(gsub("[^AB]", "", rowname), key) %>%
   select(replicate, group, key, value)

2a) If the row order is not important then the rownames_to_column, arrange and select lines could be omitted shortening it to this:

DF %>%
   as.list %>%
   split(sub("[AB]", "", names(.))) %>%
   lapply(unlist) %>%
   as.data.frame %>%
   gather(key, value, starts_with("V"))

Solutions (2) and (2a) could easily be converted to base-only solutions by replacing the gather with the appropriate reshape from base as in the second reshape, i.e. the one producing d2, in (3).

3) Although the question asked for a tidyverse solution there is a fairly convenient base solution consisting of two reshape calls. The varying produced by the split is: list(group = c("groupA", "groupB"), V1 = c("VA1", "VB1"), V2 = c("VA2", "VB2")) -- that is it matches up the ith column in each set of columns.

varying <- split(names(DF)[-1], gsub("[AB]", "", names(DF))[-1])
d <- reshape(DF, dir = "long", varying = varying, v.names = names(varying))
d <- subset(d, select = -c(time, id))

d2 <- reshape(d, dir = "long", varying = list(grep("V", names(d))), v.names = "value", 
  timevar = "key")
d2 <- subset(d2, select = c(replication, group, key, value))

d2

Note: The input in reproducible form is:

DF <- structure(list(replicate = 1:5, groupA = structure(1:5, .Label = c("a", 
"b", "c", "d", "e"), class = "factor"), VA1 = c(0.3429166, -1.3183732, 
-0.7908358, -0.9963677, -1.3634609), VA2 = c(-2.30336406, -0.13516849, 
-0.10746447, -1.82557058, 0.04385812), groupB = structure(1:5, .Label = c("f", 
"g", "h", "i", "j"), class = "factor"), VB1 = c(0.05363582, -0.42586417, 
1.05134242, -1.14532536, -0.65643595), VB2 = c(1.6454078, 0.1541541, 
1.429735, 1.0815733, -0.1452877)), .Names = c("replicate", "groupA", 
"VA1", "VA2", "groupB", "VB1", "VB2"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

Thanks! I like your solution 1 because it doesn't require hardcoded column indices, and it should be fairly straightforward to adapt to many different scenarios. — Claus Wilke, Nov 25 '17 at 19:31
Have changed the ifelse to a switch for generalization to > 2 groups. — G. Grothendieck, Nov 25 '17 at 21:37

Tidying table with multiple groups of wide columns, using tidyverse

2 Answers2