Long to wide on multiple columns by data.table

Question

I have a question about data.table's melt and dcast for multiple columns. I browsed on StackOverFlow but many similar posts are not what I want. I will explain below.

First, data is about causes of problems and the value amount. Here is my part of data:

ID   Type    Problem1    Value1     Problem2    Value2    Problem3    Value3
1    A       X           500        Y           1000      Z           400
2    A       X           600        Z           700       
3    B       Y           700        Z           100
4    B       W           200        V           200
5    C       Z           500        V           500       
6    C       X           1000       W           100       V           900

Second, ID is unique. Type contains three(A, B, and C). There are 5 problems.

Take ID == 1 as an example. It is Type A and it contains 3 problems(X, Y, and Z). Its Problem X has Value 500, Problem Y has Value 1000, Problem Z has Value 400. Take ID == 5 as an example. It is Type C and contains 2 problems(Z and V). Its Problem Z has Value 500 and Problem V has Value 500.

Third, column ID, Type, Problem1, Problem2, and Problem3 are character. Value1, Value2, and Value3 are numeric.

The result what I want is:

Type    X     Y     Z     W     V
A       1100  1000  1100  0     0   
B       0     700   100   200   200
C       1000  0     500   100   1400

I don't know how to explain here properly. I want to group the Type and then summation each problem's vlaue. I think this is about long to wide. I found reference here and here. Second one may be useful. However, I have no idea where to begin. Any suggestions?

# data
dt <- fread("
ID   Type    Problem1    Value1     Problem2    Value2    Problem3    Value3
1    A       X           500        Y           1000      Z           400
2    A       X           600        Z           700       
3    B       Y           700        Z           100
4    B       W           200        V           200
5    C       Z           500        V           500       
6    C       X           1000       W           100       V           900", fill = T)

akrun · Accepted Answer · 2017-08-18T07:12:18.093

We can first melt by specifying the patterns in measure to 'long' format and then do the dcast with the fun.aggregate as sum

dcast(melt(dt, measure = patterns("^Value", "^Problem"), 
    value.name = c("Value", "Problem"))[Problem != ""
     ][, Problem := factor(Problem, levels = c("X", "Y", "Z", "W", "V"))], 
     Type ~Problem, value.var = "Value", sum, na.rm = TRUE)
#   Type    X    Y    Z   W    V
#1:    A 1100 1000 1100   0    0
#2:    B    0  700  100 200  200
#3:    C 1000    0  500 100 1400

melt from data.table can take multiple patterns in the measure argument. So, when we say "^Value" it matches all the columns that have names that start (^) with "Value" and similarly for "Problem" and create two 'value' columns. In the above, we are naming those columns as 'Value' and 'Problem' with value.name argument. As the dataset having some blanks, the long format also have the blank elements which we remove with Problem != "". The next step is only important if we need to have the columns in a specific order. So, we change the 'Problem' to factor class and specified the levels in that order. Now, the melt part is completed. The long format is now changed to 'wide' with dcast by specifying the formula, value.var column and the fun.aggregate (here it is sum)

Great answer. However, can u explain this in more detail way. `data.table` is awesome but it is not straightforward to see complex syntax like above answer. Thank you! — Peter Chen, Aug 18 '17 at 04:05
@akrun great answer! And good usage of function passing to dcast. — pogibas, Aug 18 '17 at 04:14

pogibas · Answer 2 · 2017-08-18T04:16:10.440

3

Dummy and straightforward way, but still works (hope that someone can help to improve my solution).

library(magrittr)
rbind(
    dt[, .(Type, P = Problem1, V = Value1)],
    dt[, .(Type, P = Problem2, V = Value2)],
    dt[, .(Type, P = Problem3, V = Value3)]) %>%
    .[P != ""] %>%
    dcast(Type ~ P, value.var = "V", sum)

Edit Improved by following akrun's code (passing function to dcast).

edited Aug 18 '17 at 04:16

answered Aug 18 '17 at 03:59

pogibas

24,254
17
63
100

2

I'd follow the same logic for the data.table answer and use gather and spread if you're in the dplyr/tidyr/pipe world. – thelatemail Aug 18 '17 at 04:05
`dplyr/tidyr/pipe ` is great. However, `data.table` is faster on large data. – Peter Chen Aug 18 '17 at 04:28

Josh Gilfillan · Answer 3 · 2017-08-18T04:34:06.800

This can be done quite easily with dplyr / tidyr:

library("dplyr")
library("tidyr")

# assume x is your dataframe
bind_rows(
  select(x, ID, Type, Problem = Problem1, Value = Value1),
  select(x, ID, Type, Problem = Problem2, Value = Value2),
  select(x, ID, Type, Problem = Problem3, Value = Value3)
  ) %>%
filter(!(is.na(Problem))) %>%
group_by(Type, Problem) %>%
summarise(Value = sum(Value)) %>%
spread(Problem, Value, fill = 0)

Output

# A tibble: 3 x 6
# Groups:   Type [3]
   Type     V     W     X     Y     Z
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1     A     0     0  1100  1000  1100
2     B   200   200     0   700   100
3     C  1400   100  1000     0   500

If the order of columns V - Z is important this can be easily fixed by adding a final select statement.

There are also methods to do a multiple gather in dpylr/tidyr - https://stackoverflow.com/questions/25925556/gather-multiple-sets-of-columns — thelatemail, Aug 18 '17 at 04:14

score 3 · Answer 4 · answered Aug 18 '17 at 13:02

Here is a method that uses the melt function that akrun performs and then uses matrix subsetting to return the desired result.

# melt and aggregate the data
temp <- melt(dt, measure = patterns("^Value", "^Problem"),
             value.name = c("Value", "Problem"))[
        !is.na(Value), .(Value=sum(Value)), by=.(Type, Problem)]

# set up the storage matrix
dimNames <- list(sort(unique(temp$Type)), unique(temp$Problem))
myMat <- matrix(0, length(dimNames[[1]]), length(dimNames[[2]]), dimnames=dimNames)

# fill in the matrix with the desired values
myMat[cbind(temp$Type, temp$Problem)] <- temp$Value

This returns the matrix

myMat
     X    Y   W    Z    V
A 1100 1000   0 1100    0
B    0  700 200  100  200
C 1000    0 100  500 1400

To return a data.table, you can do

data.table(myMat, keep.rownames=TRUE)
   rn    X    Y   W    Z    V
1:  A 1100 1000   0 1100    0
2:  B    0  700 200  100  200
3:  C 1000    0 100  500 1400

Long to wide on multiple columns by data.table

4 Answers4

Linked