112

I have data from an online survey where respondents go through a loop of questions 1-3 times. The survey software (Qualtrics) records this data in multiple columns—that is, Q3.2 in the survey will have columns Q3.2.1., Q3.2.2., and Q3.2.3.:

df <- data.frame(
  id = 1:10,
  time = as.Date('2009-01-01') + 0:9,
  Q3.2.1. = rnorm(10, 0, 1),
  Q3.2.2. = rnorm(10, 0, 1),
  Q3.2.3. = rnorm(10, 0, 1),
  Q3.3.1. = rnorm(10, 0, 1),
  Q3.3.2. = rnorm(10, 0, 1),
  Q3.3.3. = rnorm(10, 0, 1)
)

# Sample data

   id       time    Q3.2.1.     Q3.2.2.    Q3.2.3.     Q3.3.1.    Q3.3.2.     Q3.3.3.
1   1 2009-01-01 -0.2059165 -0.29177677 -0.7107192  1.52718069 -0.4484351 -1.21550600
2   2 2009-01-02 -0.1981136 -1.19813815  1.1750200 -0.40380049 -1.8376094  1.03588482
3   3 2009-01-03  0.3514795 -0.27425539  1.1171712 -1.02641801 -2.0646661 -0.35353058
...

I want to combine all the QN.N* columns into tidy individual QN.N columns, ultimately ending up with something like this:

   id       time loop_number        Q3.2        Q3.3
1   1 2009-01-01           1 -0.20591649  1.52718069
2   2 2009-01-02           1 -0.19811357 -0.40380049
3   3 2009-01-03           1  0.35147949 -1.02641801
...
11  1 2009-01-01           2 -0.29177677  -0.4484351
12  2 2009-01-02           2 -1.19813815  -1.8376094
13  3 2009-01-03           2 -0.27425539  -2.0646661
...
21  1 2009-01-01           3 -0.71071921 -1.21550600
22  2 2009-01-02           3  1.17501999  1.03588482
23  3 2009-01-03           3  1.11717121 -0.35353058
...

The tidyr library has the gather() function, which works great for combining one set of columns:

library(dplyr)
library(tidyr)
library(stringr)

df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% 
  mutate(loop_number = str_sub(loop_number,-2,-2)) %>%
  select(id, time, loop_number, Q3.2)


   id       time loop_number        Q3.2
1   1 2009-01-01           1 -0.20591649
2   2 2009-01-02           1 -0.19811357
3   3 2009-01-03           1  0.35147949
...
29  9 2009-01-09           3 -0.58581232
30 10 2009-01-10           3 -2.33393981

The resultant data frame has 30 rows, as expected (10 individuals, 3 loops each). However, gathering a second set of columns does not work correctly—it successfully makes the two combined columns Q3.2 and Q3.3, but ends up with 90 rows instead of 30 (all combinations of 10 individuals, 3 loops of Q3.2, and 3 loops of Q3.3; the combinations will increase substantially for each group of columns in the actual data):

df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% 
  gather(loop_number, Q3.3, starts_with("Q3.3")) %>%
  mutate(loop_number = str_sub(loop_number,-2,-2))


   id       time loop_number        Q3.2        Q3.3
1   1 2009-01-01           1 -0.20591649  1.52718069
2   2 2009-01-02           1 -0.19811357 -0.40380049
3   3 2009-01-03           1  0.35147949 -1.02641801
...
89  9 2009-01-09           3 -0.58581232 -0.13187024
90 10 2009-01-10           3 -2.33393981 -0.48502131

Is there a way to use multiple calls to gather() like this, combining small subsets of columns like this while maintaining the correct number of rows?

Jaap
  • 71,900
  • 30
  • 164
  • 175
Andrew
  • 30,151
  • 11
  • 57
  • 89
  • what's wrong with `df %>% gather(loop_number, Q3.2, starts_with("Q3."))` – Alex Sep 19 '14 at 03:58
  • That gets me one consolidated column with 60 rows. I guess that could work if I then included some sort of call to `seperate()` to divide up the Q3.3 (and beyond) values into their own columns. But that still seems like a really roundabout hacky solution… – Andrew Sep 19 '14 at 04:05
  • use `spread` i am working on a solution now :p – Alex Sep 19 '14 at 04:07
  • try this! `df %>% gather(question_number, Q3.2, starts_with("Q3.")) %>% mutate(loop_number = str_sub(question_number,-2,-2), question_number = str_sub(question_number,1,4)) %>% select(id, time, loop_number, question_number, Q3.2) %>% spread(key = question_number, value = Q3.2)` – Alex Sep 19 '14 at 04:09
  • Ooh, that works really well for the two variables. I'm curious if it's scalable though—in my real data I've got Q3.2-Q3.30, so it would need a bunch of individual calls to `spread()`. Though multiple calls appears inevitable anyway, whether it's a bunch of `generate()`s that work or nested `spread()`s… – Andrew Sep 19 '14 at 04:26
  • @akrun figured out a scalable version below, with `gather` + `separate` + `spread`. Magic stuff. – Andrew Sep 19 '14 at 04:32
  • my code should work for any number of questions, since the question number column will have distinct `Q3.2, Q3.3, ..., Q3.30`. – Alex Sep 19 '14 at 04:41
  • Oh, I see. I misread it and thought that in `spread(..., value=X)`, I would need to set `X` to each of the final column names. But that's definitely not the case. – Andrew Sep 19 '14 at 04:55
  • +1 great question. FWIW, it's neither natural nor that efficient to have to melt (gather) the whole data set only to cast (spread) back. I'll post an answer if I manage to succeed implementing melt to accomplish this directly as required in this scenario. – Arun Sep 19 '14 at 12:44

5 Answers5

151

This approach seems pretty natural to me:

df %>%
  gather(key, value, -id, -time) %>%
  extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>%
  spread(question, value)

First gather all question columns, use extract() to separate into question and loop_number, then spread() question back into the columns.

#>    id       time loop_number         Q3.2        Q3.3
#> 1   1 2009-01-01           1  0.142259203 -0.35842736
#> 2   1 2009-01-01           2  0.061034802  0.79354061
#> 3   1 2009-01-01           3 -0.525686204 -0.67456611
#> 4   2 2009-01-02           1 -1.044461185 -1.19662936
#> 5   2 2009-01-02           2  0.393808163  0.42384717
hadley
  • 94,313
  • 27
  • 170
  • 239
  • 5
    Hello. I have many columns with names ending in 1 and 2, like age1, age2, weight1, weight2, blood1, blood2.... How would I apply your method here? – skan May 03 '16 at 12:19
  • 4
    What does this part mean: "(Q.\\..)\\.(.)" What would I search for to decode what is happening there? – mob Nov 18 '17 at 12:05
  • 3
    @mob Regular expressions – hadley Nov 21 '17 at 13:17
  • 3
    @mob "(Q.\\..)\\.(.)" is a regular expression with parentheses that define the groups of the regular expression to extract into "question" and "loop_number". More specifically, in this example, the items in key with the expression "Q.\\.." goes into the "question" column (i.e., "Q3.2" and "Q3.3"), then the part after next period, expressed as ".", goes into the "loop_number" column. – LC-datascientist Jan 24 '20 at 23:18
33

This could be done using reshape. It is possible with dplyr though.

  colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))
  colnames(df)[2] <- "Date"
  res <- reshape(df, idvar=c("id", "Date"), varying=3:8, direction="long", sep="_")
  row.names(res) <- 1:nrow(res)
  
   head(res)
  #  id       Date time       Q3.2       Q3.3
  #1  1 2009-01-01    1  1.3709584  0.4554501
  #2  2 2009-01-02    1 -0.5646982  0.7048373
  #3  3 2009-01-03    1  0.3631284  1.0351035
  #4  4 2009-01-04    1  0.6328626 -0.6089264
  #5  5 2009-01-05    1  0.4042683  0.5049551
  #6  6 2009-01-06    1 -0.1061245 -1.7170087

Or using dplyr

  library(tidyr)
  library(dplyr)
  colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))

  df %>%
     gather(loop_number, "Q3", starts_with("Q3")) %>% 
     separate(loop_number,c("L1", "L2"), sep="_") %>% 
     spread(L1, Q3) %>%
     select(-L2) %>%
     head()
  #  id       time       Q3.2       Q3.3
  #1  1 2009-01-01  1.3709584  0.4554501
  #2  1 2009-01-01  1.3048697  0.2059986
  #3  1 2009-01-01 -0.3066386  0.3219253
  #4  2 2009-01-02 -0.5646982  0.7048373
  #5  2 2009-01-02  2.2866454 -0.3610573
  #6  2 2009-01-02 -1.7813084 -0.7838389

Update

With new version of tidyr, we can use pivot_longer to reshape multiple columns. (Using the changed column names from gsub above)

library(dplyr)
library(tidyr)
df %>% 
    pivot_longer(cols = starts_with("Q3"), 
          names_to = c(".value", "Q3"), names_sep = "_") %>% 
    select(-Q3)
# A tibble: 30 x 4
#      id time         Q3.2    Q3.3
#   <int> <date>      <dbl>   <dbl>
# 1     1 2009-01-01  0.974  1.47  
# 2     1 2009-01-01 -0.849 -0.513 
# 3     1 2009-01-01  0.894  0.0442
# 4     2 2009-01-02  2.04  -0.553 
# 5     2 2009-01-02  0.694  0.0972
# 6     2 2009-01-02 -1.11   1.85  
# 7     3 2009-01-03  0.413  0.733 
# 8     3 2009-01-03 -0.896 -0.271 
#9     3 2009-01-03  0.509 -0.0512
#10     4 2009-01-04  1.81   0.668 
# … with 20 more rows

NOTE: Values are different because there was no set seed in creating the input dataset

akrun
  • 674,427
  • 24
  • 381
  • 486
  • Whoa, this works perfectly. tidyr is ostensibly a replacement/upgrade for reshape--I wonder if @hadley knows of a way to do this same thing with dplyr or tidyr… – Andrew Sep 19 '14 at 04:29
  • That is pure magic. The only thing I added was `mutate(loop_number = as.numeric(L2))` before dropping `L2`, and it's perfect. – Andrew Sep 19 '14 at 04:35
  • 1
    @Andrew I personally prefer the `reshape` method for its compact code, though `dplyr` may be faster for big datasets. – akrun Sep 19 '14 at 04:39
  • 2
    I've never been able to understand the `reshape()` function, see my solution for what I seems to me a pretty clean tidyr implementation. – hadley Sep 19 '14 at 10:46
  • 1
    `tidyr::pivot_longer`: Now not only in the development version :) – tjebo Feb 18 '21 at 14:41
  • 1
    @tjebo thank you. I updated that part – akrun Feb 18 '21 at 16:51
22

With the recent update to melt.data.table, we can now melt multiple columns. With that, we can do:

require(data.table) ## 1.9.5
melt(setDT(df), id=1:2, measure=patterns("^Q3.2", "^Q3.3"), 
     value.name=c("Q3.2", "Q3.3"), variable.name="loop_number")
 #    id       time loop_number         Q3.2        Q3.3
 # 1:  1 2009-01-01           1 -0.433978480  0.41227209
 # 2:  2 2009-01-02           1 -0.567995351  0.30701144
 # 3:  3 2009-01-03           1 -0.092041353 -0.96024077
 # 4:  4 2009-01-04           1  1.137433487  0.60603396
 # 5:  5 2009-01-05           1 -1.071498263 -0.01655584
 # 6:  6 2009-01-06           1 -0.048376809  0.55889996
 # 7:  7 2009-01-07           1 -0.007312176  0.69872938

You can get the development version from here.

Arun
  • 108,644
  • 21
  • 263
  • 366
  • Hello. I have many columns with names ending in 1 and 2, like age1, age2, weight1, weight2, blood1, blood2.... How would I apply your method here? – skan May 03 '16 at 12:19
  • skan, check the [reshaping vignette](https://github.com/Rdatatable/data.table/wiki/Getting-started). Good luck! – Arun May 03 '16 at 14:09
  • I did but I don't know how to properly embed regular expressions to split column names and pass it to melt. There is only one example with patterns, and it's too simple. In my case I would need to include many many column names inside pattern() – skan May 03 '16 at 17:21
  • Imagine you have these columns: paste0(rep(LETTERS, each=3), 1:3) and you want to get the long table defined by a letter and a number – skan May 03 '16 at 17:44
  • This is hands down the most succinct and easy to interpret. – Michael Bellhouse Dec 27 '19 at 18:04
10

It's not at all related to "tidyr" and "dplyr", but here's another option to consider: merged.stack from my "splitstackshape" package, V1.4.0 and above.

library(splitstackshape)
merged.stack(df, id.vars = c("id", "time"), 
             var.stubs = c("Q3.2.", "Q3.3."),
             sep = "var.stubs")
#     id       time .time_1       Q3.2.       Q3.3.
#  1:  1 2009-01-01      1. -0.62645381  1.35867955
#  2:  1 2009-01-01      2.  1.51178117 -0.16452360
#  3:  1 2009-01-01      3.  0.91897737  0.39810588
#  4:  2 2009-01-02      1.  0.18364332 -0.10278773
#  5:  2 2009-01-02      2.  0.38984324 -0.25336168
#  6:  2 2009-01-02      3.  0.78213630 -0.61202639
#  7:  3 2009-01-03      1. -0.83562861  0.38767161
# <<:::SNIP:::>>
# 24:  8 2009-01-08      3. -1.47075238 -1.04413463
# 25:  9 2009-01-09      1.  0.57578135  1.10002537
# 26:  9 2009-01-09      2.  0.82122120 -0.11234621
# 27:  9 2009-01-09      3. -0.47815006  0.56971963
# 28: 10 2009-01-10      1. -0.30538839  0.76317575
# 29: 10 2009-01-10      2.  0.59390132  0.88110773
# 30: 10 2009-01-10      3.  0.41794156 -0.13505460
#     id       time .time_1       Q3.2.       Q3.3.
A5C1D2H2I1M1N2O1R2T1
  • 177,446
  • 27
  • 370
  • 450
  • 1
    Hello. I have many columns with names ending in 1 and 2, like age1, age2, weight1, weight2, blood1, blood2.... How would I apply your method here? – skan May 03 '16 at 12:19
6

In case you are like me, and cannot work out how to use "regular expression with capturing groups" for extract, the following code replicates the extract(...) line in Hadleys' answer:

df %>% 
    gather(question_number, value, starts_with("Q3.")) %>%
    mutate(loop_number = str_sub(question_number,-2,-2), question_number = str_sub(question_number,1,4)) %>%
    select(id, time, loop_number, question_number, value) %>% 
    spread(key = question_number, value = value)

The problem here is that the initial gather forms a key column that is actually a combination of two keys. I chose to use mutate in my original solution in the comments to split this column into two columns with equivalent info, a loop_number column and a question_number column. spread can then be used to transform the long form data, which are key value pairs (question_number, value) to wide form data.

Alex
  • 13,293
  • 9
  • 60
  • 113