Complex melting with dplyr, tidyr, reshape2 in R

Question

I am trying to understand how "melting" works in complex situations. I've seen plenty of posts and blogs about using these packages in very simple cases - but not in more difficult ones. For example:

Lets say I have the following data in a dataframe:

  CA UNIT      SCP    DATE1    TIME1   DESC1 ENTRIES1  EXITS1  
1 A002 R051 02-00-00 07-27-13 00:00:00 REGULAR  4209603 1443585
2 A002 R051 02-00-00 07-28-13 08:00:00 REGULAR  4210490 1443821
3 A002 R051 02-00-00 07-29-13 16:00:00 REGULAR  4211586 1444302
4 A002 R051 02-00-00 07-30-13 14:01:46   LOGON  4213192 1444700
5 A002 R051 02-00-00 07-30-13 16:00:00 REGULAR  4213333 1444737
6 A002 R051 02-00-00 08-01-13 00:00:00 REGULAR  4215894 1445274`

and continued columns to the right (sorry, I couldn't format it properly in the code block):

     `DATE2    TIME2     DESC2 ENTRIES2  EXITS2
1   07-27-13 08:00:00   REGULAR  4209663 1443616 
2   07-28-13 16:00:00   REGULAR  4210775 1443921 
3   07-30-13 00:00:00   REGULAR  4212845 1444369 
4   07-30-13 14:02:18 DOOR OPEN  4213192 1444700 
5   07-31-13 00:00:00   REGULAR  4214345 1444823 
6   08-01-13 08:00:00   REGULAR  4215977 1445362`

and I want to melt this into a dataframe with the following format:

  CA UNIT      SCP    DATE    TIME        DESC    ENTRIES  EXITS  
1 A002 R051 02-00-00 07-27-13 00:00:00   REGULAR  4209603 1443585
2 A002 R051 02-00-00 07-28-13 08:00:00   REGULAR  4210490 1443821
3 A002 R051 02-00-00 07-29-13 16:00:00   REGULAR  4211586 1444302
4 A002 R051 02-00-00 07-30-13 14:01:46     LOGON  4213192 1444700
5 A002 R051 02-00-00 07-30-13 16:00:00   REGULAR  4213333 1444737
6 A002 R051 02-00-00 08-01-13 00:00:00   REGULAR  4215894 1445274
7 A002 R051 02-00-00 07-27-13 08:00:00   REGULAR  4209663 1443616
8 A002 R051 02-00-00 07-28-13 16:00:00   REGULAR  4210775 1443921 
9 A002 R051 02-00-00 07-30-13 00:00:00   REGULAR  4212845 1444369
10A002 R051 02-00-00 07-30-13 14:02:18 DOOR OPEN  4213192 1444700
11A002 R051 02-00-00 07-31-13 00:00:00   REGULAR  4214345 1444823
12A002 R051 02-00-00 08-01-13 08:00:00   REGULAR  4215977 1445362

The challenge here is that the columns I want to "melt" have different data types. All the posts I read are very straightforward and assume all the melted columns are of the same data type and will fall into nice key/value pairs. That is clearly not the case here.

I have found another post that indicates this restructuring can be done using the 'replace' from stats. I get that. But if dplyr, reshape2, and tidyr can't be used for more complicated real world scenarios what is the real use?

Please show how to do this with tidyr, dplyr or reshape2.

Thank you in advance!

score 6 · Answer 1 · answered Aug 14 '16 at 22:18

Here is an option with reshape from base R, if you look at the ?reshape the default separator is . and if you specify the sep as "", then it will use regular expression to split the column names which is [A-Za-z][0-9] which suits your case pretty well here. You can drop the time and id variable if you don't need them:

reshape(df, varying = 4:13, dir = "long", sep = "")

      CA UNIT      SCP time     DATE     TIME     DESC ENTRIES   EXITS id
 1: A002 R051 02-00-00    1 07-27-13 00:00:00  REGULAR 4209603 1443585  1
 2: A002 R051 02-00-00    1 07-28-13 08:00:00  REGULAR 4210490 1443821  2
 3: A002 R051 02-00-00    1 07-29-13 16:00:00  REGULAR 4211586 1444302  3
 4: A002 R051 02-00-00    1 07-30-13 14:01:46    LOGON 4213192 1444700  4
 5: A002 R051 02-00-00    1 07-30-13 16:00:00  REGULAR 4213333 1444737  5
 6: A002 R051 02-00-00    1 08-01-13 00:00:00  REGULAR 4215894 1445274  6
 7: A002 R051 02-00-00    2 07-27-13 08:00:00  REGULAR 4209663 1443616  1
 8: A002 R051 02-00-00    2 07-28-13 16:00:00  REGULAR 4210775 1443921  2
 9: A002 R051 02-00-00    2 07-30-13 00:00:00  REGULAR 4212845 1444369  3
10: A002 R051 02-00-00    2 07-30-13 14:02:18 DOOROPEN 4213192 1444700  4
11: A002 R051 02-00-00    2 07-31-13 00:00:00  REGULAR 4214345 1444823  5
12: A002 R051 02-00-00    2 08-01-13 08:00:00  REGULAR 4215977 1445362  6

Thank you for this. I think it is the best way to do this although my question was oriented toward better understanding complex manipulations with dplyr, tidyr, and reshape2. question: using the "4:13" syntax for "varying", how does the function know to properly stack columns? Does it just know we're dealing with an equal number and stack accordingly? — Windstorm1981, Aug 16 '16 at 01:34
The way `reshape` works is that it will first split your columns names either by separator or regular expression if a separator does not exist as is your case. And columns with left hand side of the separated names matched will be stacked in one column while the right hand side number will be the time variable. — Psidom, Aug 16 '16 at 01:42
And the varying `4:13` specifies the column numbers which needs to be stacked so that only columns from 4 to 13 will be divided by the group number at the end of column names and stacked according to the first half of the column name. — Psidom, Aug 16 '16 at 01:54

eipi10 · Answer 2 · 2016-08-14T23:05:38.117

There's probably a nice tidyr and/or data.table solution, but you can also do the following in base R to stack the two five-column groups:

names(dat) = gsub("1|2", "", names(dat))
rbind(dat[,1:8], dat[,c(1:3,9:13)])

     CA UNIT      SCP     DATE     TIME      DESC ENTRIES   EXITS
1  A002 R051 02-00-00 07-27-13 00:00:00   REGULAR 4209603 1443585
2  A002 R051 02-00-00 07-28-13 08:00:00   REGULAR 4210490 1443821
3  A002 R051 02-00-00 07-29-13 16:00:00   REGULAR 4211586 1444302
4  A002 R051 02-00-00 07-30-13 14:01:46     LOGON 4213192 1444700
5  A002 R051 02-00-00 07-30-13 16:00:00   REGULAR 4213333 1444737
6  A002 R051 02-00-00 08-01-13 00:00:00   REGULAR 4215894 1445274
7  A002 R051 02-00-00 07-27-13 08:00:00   REGULAR 4209663 1443616
8  A002 R051 02-00-00 07-28-13 16:00:00   REGULAR 4210775 1443921
9  A002 R051 02-00-00 07-30-13 00:00:00   REGULAR 4212845 1444369
10 A002 R051 02-00-00 07-30-13 14:02:18 DOOR_OPEN 4213192 1444700
11 A002 R051 02-00-00 07-31-13 00:00:00   REGULAR 4214345 1444823
12 A002 R051 02-00-00 08-01-13 08:00:00   REGULAR 4215977 1445362

In your example, you only have two five-column sets that need to be stacked. If you have more such groups, you could do something like this to avoid lots of hard-coding:

dat = lapply(seq(4,ncol(dat),5), function(i) {
  tmp = dat[, c(1:3, i:(i+4))]
  names(tmp) = gsub("[0-9]", "", names(tmp))
  tmp
})

dat = do.call(rbind, dat)

Well, to paraphrase David St. Hubbins in *This Is Spinal Tap*, "There's a fine line between clever and stupid." — eipi10, Aug 18 '16 at 00:47

score 3 · Answer 3 · answered Aug 15 '16 at 02:29

We can do this easily with the melt from data.table as it can take multiple measure patterns.

library(data.table)
melt(setDT(df), measure = patterns(c("DATE", "TIME", "DESC", "ENTRIES", 
       "EXITS")), value.name = c('DATE', 'TIME', 'DESC', 'ENTRIES', 
       'EXITS'))[, variable := NULL][]
#       CA UNIT      SCP     DATE     TIME      DESC ENTRIES   EXITS
#1: A002 R051 02-00-00 07-27-13 00:00:00   REGULAR 4209603 1443585
#2: A002 R051 02-00-00 07-28-13 08:00:00   REGULAR 4210490 1443821
#3: A002 R051 02-00-00 07-29-13 16:00:00   REGULAR 4211586 1444302
#4: A002 R051 02-00-00 07-30-13 14:01:46     LOGON 4213192 1444700
#5: A002 R051 02-00-00 07-30-13 16:00:00   REGULAR 4213333 1444737
#6: A002 R051 02-00-00 08-01-13 00:00:00   REGULAR 4215894 1445274
#7: A002 R051 02-00-00 07-27-13 08:00:00   REGULAR 4209663 1443616
#8: A002 R051 02-00-00 07-28-13 16:00:00   REGULAR 4210775 1443921
#9: A002 R051 02-00-00 07-30-13 00:00:00   REGULAR 4212845 1444369
#10:A002 R051 02-00-00 07-30-13 14:02:18 DOOR OPEN 4213192 1444700
#11:A002 R051 02-00-00 07-31-13 00:00:00   REGULAR 4214345 1444823
#12:A002 R051 02-00-00 08-01-13 08:00:00   REGULAR 4215977 1445362

Thanks. I am Not as familiar with the data.table package as I would like to be. I need to review your answer in detail to make sure I understand it. — Windstorm1981, Aug 16 '16 at 01:38

score 1 · Answer 4 · answered Aug 14 '16 at 23:10

In tidyr, you need to gather to long, fix the column names, and then spread back to wide. Since spread is finicky about indices, you'll need to add a unique ID column, which dplyr::add_rownames can do nicely:

library(tidyr)
library(dplyr)

df %>% add_rownames() %>% 
    # gather to long form
    gather(var, val, DATE1:EXITS2) %>% 
    # separate ID suffix from variable name by position
    separate(var, c('var', 'id'), sep = -2) %>% 
    # spread back to wide form
    spread(var, val, convert = TRUE) %>% 
    # clean up extra columns
    select(-rowname, -id)

## # A tibble: 12 x 8
##        CA   UNIT      SCP     DATE      DESC ENTRIES   EXITS     TIME
## *  <fctr> <fctr>   <fctr>    <chr>     <chr>   <int>   <int>    <chr>
## 1    A002   R051 02-00-00 07-27-13   REGULAR 4209603 1443585 00:00:00
## 2    A002   R051 02-00-00 07-27-13   REGULAR 4209663 1443616 08:00:00
## 3    A002   R051 02-00-00 07-28-13   REGULAR 4210490 1443821 08:00:00
## 4    A002   R051 02-00-00 07-28-13   REGULAR 4210775 1443921 16:00:00
## 5    A002   R051 02-00-00 07-29-13   REGULAR 4211586 1444302 16:00:00
## 6    A002   R051 02-00-00 07-30-13   REGULAR 4212845 1444369 00:00:00
## 7    A002   R051 02-00-00 07-30-13     LOGON 4213192 1444700 14:01:46
## 8    A002   R051 02-00-00 07-30-13 DOOR OPEN 4213192 1444700 14:02:18
## 9    A002   R051 02-00-00 07-30-13   REGULAR 4213333 1444737 16:00:00
## 10   A002   R051 02-00-00 07-31-13   REGULAR 4214345 1444823 00:00:00
## 11   A002   R051 02-00-00 08-01-13   REGULAR 4215894 1445274 00:00:00
## 12   A002   R051 02-00-00 08-01-13   REGULAR 4215977 1445362 08:00:00

Thanks. I need to review your answer in detail to make sure I completely understand. I may have a question or two. — Windstorm1981, Aug 16 '16 at 01:40

Complex melting with dplyr, tidyr, reshape2 in R

4 Answers4