Tidying dataset by gathering multiple columns?

Question

I want to tidy a dataset by manipulating the data this way:

age gender  education       previous_comp_exp   tutorial_time   qID.1    time_taken.1   qID.2    time_taken.2   
18  Male    Undergraduate   casual gamer        62.17926        sor9     39.61206       sor8     19.4892
24  Male    Undergraduate   casual gamer        85.01288        sor9     50.92343       sor8     16.15616

To become this:

age gender  education       previous_comp_exp   tutorial_time   qID      time_taken 
18  Male    Undergraduate   casual gamer        62.17926        sor9     39.61206       
18  Male    Undergraduate   casual gamer        62.17926        sor8     19.4892
24  Male    Undergraduate   casual gamer        85.01288        sor9     50.92343       
24  Male    Undergraduate   casual gamer        85.01288        sor8     16.15616

I've experimented with gather(), but I can only only get it working with one column and I keep getting this warning:

Warning message:attributes are not identical across measure variables; they will be dropped

Any ideas?

It's not an error, it's a warning to let you know that two of the columns you've stacked have different attributes (maybe they're both factors, but with different levels) and those attributes were therefore dropped in the output. [This](https://stackoverflow.com/q/41856564/496488) and [this](https://stackoverflow.com/questions/12466493/reshaping-multiple-sets-of-measurement-columns-wide-format-into-single-columns) might be helpful in taking care of the reshaping from wide to long, given that you have pairs of columns that each need to be stacked. — eipi10, Nov 13 '17 at 19:59

acylam · Accepted Answer · 2017-11-19T04:25:11.340

With melt from data.table (see ?patterns):

library(data.table)

melt(setDT(df), measure = patterns("^qID", "^time_taken"),
     value.name = c("qID", "time_taken"))

Result:

   age gender     education previous_comp_exp tutorial_time variable  qID time_taken
1:  18   Male Undergraduate      casual_gamer      62.17926        1 sor9   39.61206
2:  24   Male Undergraduate      casual_gamer      85.01288        1 sor9   50.92343
3:  18   Male Undergraduate      casual_gamer      62.17926        2 sor8   19.48920
4:  24   Male Undergraduate      casual_gamer      85.01288        2 sor8   16.15616

or with tidyr:

library(dplyr)
library(tidyr)

df %>%
  gather(variable, value, qID.1:time_taken.2) %>%
  mutate(variable = sub("\\.\\d$", "", variable)) %>%
  group_by(variable) %>%
  mutate(ID = row_number()) %>%
  spread(variable, value, convert = TRUE) %>%
  select(-ID)

Result:

# A tibble: 4 x 7
    age gender     education previous_comp_exp tutorial_time   qID time_taken
  <int> <fctr>        <fctr>            <fctr>         <dbl> <chr>      <dbl>
1    18   Male Undergraduate      casual_gamer      62.17926  sor9   39.61206
2    18   Male Undergraduate      casual_gamer      62.17926  sor8   19.48920
3    24   Male Undergraduate      casual_gamer      85.01288  sor9   50.92343
4    24   Male Undergraduate      casual_gamer      85.01288  sor8   16.15616

Note:

For the tidyr method, convert=TRUE is used to convert time_taken back to numeric, since it was coerced to character when gathered with the qID columns.

Data:

df = structure(list(age = c(18L, 24L), gender = structure(c(1L, 1L
), .Label = "Male", class = "factor"), education = structure(c(1L, 
1L), .Label = "Undergraduate", class = "factor"), previous_comp_exp = structure(c(1L, 
1L), .Label = "casual_gamer", class = "factor"), tutorial_time = c(62.17926, 
85.01288), qID.1 = structure(c(1L, 1L), .Label = "sor9", class = "factor"), 
    time_taken.1 = c(39.61206, 50.92343), qID.2 = structure(c(1L, 
    1L), .Label = "sor8", class = "factor"), time_taken.2 = c(19.4892, 
    16.15616)), .Names = c("age", "gender", "education", "previous_comp_exp", 
"tutorial_time", "qID.1", "time_taken.1", "qID.2", "time_taken.2"
), class = "data.frame", row.names = c(NA, -2L))

@useR, in the Tidy approach, the age variable is gone. any ideas why? — stenfeio, Nov 13 '17 at 20:33
@stenfeio Thanks for catching that! I actually read in the data incorrectly. Since I was using `read.table`, `casual` and `gamer` were treated as separate columns and the first column as rownames. It didn't throw an error because the number of columns happened to match up. See my edits — acylam, Nov 13 '17 at 20:39
I didn't know the pattern function, nice line. @useR I have a question I tried before seeing your solution: time — denis, Nov 13 '17 at 20:49
@denis Just a silly mistake. You misspelled "measure". So `melt` uses the default of _"If missing, all non-id columns will be assigned to it."_ (see `?melt`) — acylam, Nov 13 '17 at 20:56
@useR Wow sorry I think I am tired. Thank you so much for the help and your nice answer. Indeed it used the default. — denis, Nov 13 '17 at 21:00
You can use `convert = TRUE` in `spread` rather than manually setting types in `mutate` later. — aosmith, Nov 14 '17 at 16:10

agstudy · Answer 2 · 2017-11-13T20:43:05.710

6

In base R you can use the powerful reshape to transform your data from wide to long format in one-line statement:

   reshape(dx,direction="long",
        varying=list(grep("qID",colnames(dx)),
                     grep("time_taken",colnames(dx))),
        v.names=c("qID","time_taken"))

     age gender     education previous_comp_exp tutorial_time time  qID time_taken id
1.1  18   Male Undergraduate      casual_gamer      62.17926    1 sor9   39.61206  1
2.1  24   Male Undergraduate      casual_gamer      85.01288    1 sor9   50.92343  2
1.2  18   Male Undergraduate      casual_gamer      62.17926    2 sor8   19.48920  1
2.2  24   Male Undergraduate      casual_gamer      85.01288    2 sor8   16.15616  2

edited Nov 13 '17 at 20:43

answered Nov 13 '17 at 20:07

agstudy

113,354
16
180
244

I think you also read in the data incorrectly. You can use the new `dput` in my answer. – acylam Nov 13 '17 at 20:40
@useR good catch. fixed now. – agstudy Nov 13 '17 at 20:43

Tidying dataset by gathering multiple columns?

2 Answers2