Combine Multiple Columns Into Tidy Data

Question

My dataset looks like this:

unique.id abx.1    start.1     stop.1 abx.2    start.2     stop.2 abx.3    start.3     stop.3 abx.4    start.4
1         1  Moxi 2014-01-01 2014-01-07  PenG 2014-01-01 2014-01-07 Vanco 2014-01-01 2014-01-07  Moxi 2014-01-01
2         2  Moxi 2014-01-01 2014-01-02 Cipro 2014-01-01 2014-01-02  PenG 2014-01-01 2014-01-02 Vanco 2014-01-01
3         3 Cipro 2014-01-01 2014-01-05 Vanco 2014-01-01 2014-01-05 Cipro 2014-01-01 2014-01-05 Vanco 2014-01-01
4         4 Vanco 2014-01-02 2014-01-03 Cipro 2014-01-02 2014-01-03 Cipro 2014-01-02 2014-01-03  PenG 2014-01-02
5         5 Vanco 2014-01-01 2014-01-02  PenG 2014-01-01 2014-01-02  PenG 2014-01-01 2014-01-02 Cipro 2014-01-01
      stop.4    intervention
1 2014-01-07       0
2 2014-01-02       0
3 2014-01-05       1
4 2014-01-03       1
5 2014-01-02       0

With some code to create this:

 abxoptions <- c("Cipro", "Moxi", "PenG", "Vanco")
      df3 <- data.frame(
      unique.id = 1:5,
      abx.1 = sample(abxoptions,5, replace=TRUE),
      start.1 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
      stop.1  = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
      abx.2 = sample(abxoptions,5, replace=TRUE),         
      start.2 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
      stop.2  = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
      abx.3 = sample(abxoptions,5, replace=TRUE),         
      start.3 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
      stop.3  = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
      abx.4 = sample(abxoptions,5, replace=TRUE),         
      start.4 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
      stop.4  = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
      intervention = c(0,0,1,1,0)

)

I would like to tidy this data to look like this:

unique.id    abx     start    stop           intervention
1            Moxi    2014-01-10 2014-01-07      0
1            Pen G   2014-01-01 2014-01-07      0
1            Vanco   2014-01-01 2014-01-07      0
1            Moxi    2014-01-01 2014-01-07      0  etc etc

The following solutions didn't get me where I needed: Gather multiple sets of columns and Combining multiple columns into one

I suspect that Hadley's amazing tidyr pakcage is the way to go...just can't figure this out. Any help would be greatly appreciated.

score 10 · Accepted Answer · edited Feb 26 '15 at 22:31

10

Almost every data tidying problem can be solved in three steps:

Gather all non-variable columns
Separate "colname" column into multiple variables
Re-spread the data

(often you'll only need one or two of these, but I think they're almost always in this order).

For your data:

The only column that's already a variable is unique.id
You need to split current column names into variable and number
Then you need to put the "variable" variable back into columns

This looks like:

library(tidyr)
library(dplyr)

df3 %>%
  gather(col, value, -unique.id, -intervention) %>%
  separate(col, c("variable", "number")) %>%
  spread(variable, value, convert = TRUE) %>%
  mutate(start = as.Date(start, "1970-01-01"), stop = as.Date(stop, "1970-01-01"))

Your case is a bit more complicated because you have two types of variables, so you need to restore the types at the end.

edited Feb 26 '15 at 22:31

Tom O

1,157
2
11
14

answered Feb 25 '15 at 22:45

hadley

94,313
27
170
239

Hadley. really appreciate the response. This solved it. I had to remove another column in my actual dataset which marked if this occured in the intervention phase or not (1 or 0). With this column still included I get : Error: Values not split into 2 pieces at.675,676.....then attributes are not identical across measure variables; they will be dropped . I need to read up further on your documentation as I'm still not totally clear on what's occuring here. – Tom O Feb 26 '15 at 14:39
1

I'd recommend running a line-at-a-time and looking at the results for each step. For the split problem, you'll need to look at the extra args to separate. – hadley Feb 26 '15 at 19:32
Just what I needed! Problem solved! – Tom O Feb 26 '15 at 20:33
Much love to H Wicks – tylcole Feb 14 '19 at 00:38

score 7 · Answer 2 · answered Feb 25 '15 at 21:14

7

You could try reshape from base R

reshape(df3, direction='long', varying=2:ncol(df3), sep=".")

Or use merged.stack from splitstackshape

 library(splitstackshape)
 merged.stack(df3, var.stubs=c('abx', 'start', 'stop'), sep='.')[,
    c('start', 'stop') := lapply(.SD, as.Date,
                   origin='1970-01-01'), .SDcols=4:5][]

answered Feb 25 '15 at 21:14

akrun

674,427
24
381
486

Hmm, I'm getting an error when I attempt the reshape method. I should note the data in my dataframe extends to abx.14 :Error in guess(varying) : failed to guess time-varying variables from their names In addition: Warning message: In rbind(c("start", "date"), c("stop", "date"), c("abx", "2"), c("start", : number of columns of result is not a multiple of vector length (arg 1) – Tom O Feb 25 '15 at 21:27
@TomO Based on the example you created, I didn't get any error – akrun Feb 25 '15 at 21:28
You're correct in my sample dataset it works. – Tom O Feb 25 '15 at 21:29
@TomO Not sure how you got the error. I am using `R 3.1.2`. Can you change the `.` to `_` for the column names and try again? ie. `names(df3)[-1] – akrun Feb 25 '15 at 21:31
I suspect there's further errors in the colnames in my original data set. I'll clean and report back. Really appreciate the help akrun. I've read many of your responses previously – Tom O Feb 25 '15 at 21:33
1

@TomO Glad to help you. It's a bit late here. So, I will check your update later. – akrun Feb 25 '15 at 21:34
akrun - As with the tidyr solution removing this extra column intervention (1/0) solved my error.] – Tom O Feb 26 '15 at 14:46

Arun · Answer 3 · 2015-07-28T21:47:40.623

Recently, a new feature has been added to melt.data.table, which allows melting into multiple columns painless. All you've to do is provide the columns you'd want to melt separately in a list in measure.vars argument.

You can grab the development version by following these instructions.

require(data.table) ## v1.9.5
setDT(dat) # dat is now a data.table
melt(dat, id = 1L, measure = patterns("^abx", "^start", "^stop"), 
          value.name = c("abx", "start", "stop"))

#     unique.id variable   abx      start       stop
#  1:         1        1  Moxi 2014-01-01 2014-01-07
#  2:         2        1  Moxi 2014-01-01 2014-01-02
#  3:         3        1 Cipro 2014-01-01 2014-01-05
#  4:         4        1 Vanco 2014-01-02 2014-01-03
#  5:         5        1 Vanco 2014-01-01 2014-01-02
#  6:         1        2  PenG 2014-01-01 2014-01-07
#  7:         2        2 Cipro 2014-01-01 2014-01-02
#  8:         3        2 Vanco 2014-01-01 2014-01-05
#  9:         4        2 Cipro 2014-01-02 2014-01-03
# 10:         5        2  PenG 2014-01-01 2014-01-02
# 11:         1        3 Vanco 2014-01-01 2014-01-07
# 12:         2        3  PenG 2014-01-01 2014-01-02
# 13:         3        3 Cipro 2014-01-01 2014-01-05
# 14:         4        3 Cipro 2014-01-02 2014-01-03
# 15:         5        3  PenG 2014-01-01 2014-01-02
# 16:         1        4  Moxi 2014-01-01 2014-01-07
# 17:         2        4 Vanco 2014-01-01 2014-01-02
# 18:         3        4 Vanco 2014-01-01 2014-01-05
# 19:         4        4  PenG 2014-01-02 2014-01-03
# 20:         5        4 Cipro 2014-01-01 2014-01-02

I've used column numbers here, but you can provide column names as well.

Combine Multiple Columns Into Tidy Data

3 Answers3

Linked

Related