Averaging row and column cells from multiple data frames

Question

I have multiple data frames, like:

DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5))

head(DG)
  y v x
1 1 3 4
2 3 4 6
3 1 5 4
4 3 6 6
5 1 7 4
6 3 8 6

head(DT)
  y v x
1 1 3 4
2 3 4 5
3 1 5 4
4 3 6 5
5 1 7 4
6 3 8 5

head(DF)
  y v  x
1 1 3 12
2 3 4 12
3 1 5 12
4 3 6 12
5 1 7 12
6 3 8 12

I want to calculate means of each 'row' but from each column of each data frame, i.e. the resulting data frame I need looks like:

          y                            v                         x
    1 'mean(DG(y1)DT(y1),DF(y1))' 'mean(DG(v1)DT(v1),DF(v1))' 'mean(DG(x1)DT(x1),DF(x1))'
    2 'mean(DG(y2)DT(y2),DF(y2))' 'mean(DG(v2)DT(v2),DF(v2))' 'mean(DG(x2)DT(x2),DF(x2))'
    3 'mean(DG(y3)DT(y3),DF(y3))' 'mean(DG(v3)DT(v3),DF(v3))' 'mean(DG(x3)DT(x3),DF(x3))'
    ....

In reality, y, v and x are different locations and 1 - 6 time steps. I want to average my data for each time step and location. Eventually, I need one data set, that looks like one of the example data sets, but with averaged values in each cell.

I have a working example with loops, but for large datasets it is very slow, so I tried various combinations with apply and rowSums, but neither worked out.

score 1 · Answer 1 · answered Jan 18 '19 at 18:06

If I understand correctly, there are many data frames which all have the same structure (number, name and type of columns) as well as the same number of rows (time steps). Some data points may contain NA.

The code below creates a large data.table from the single data frames and computes the mean values for each time step and location across the different data frames:

library(data.table)
rbindlist(list(DG, DF, DT), idcol = TRUE)[
  , lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]

   time_step y v        x
1:         1 1 3 6.666667
2:         2 3 4 8.333333
3:         3 1 5 6.666667
4:         4 3 6 8.333333
5:         5 1 7 6.666667
6:         6 3 8 8.333333

This will work also with NAs, e.g.,

DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5,NA))

Note that column x of DT has been modified

rbindlist(list(DG, DF, DT), idcol = TRUE)[
  , lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]

   time_step y v         x
1:         1 1 3  6.666667
2:         2 3 4  8.333333
3:         3 1 5  8.000000
4:         4 3 6  8.000000
5:         5 1 7  7.000000
6:         6 3 8 10.000000

Note that x in rows 3 and 6 has changed.

Nice, didn't know you could use functions/make new columns in `by` (but of course you can because it's `data.table`). — Gregor Thomas, Jan 18 '19 at 19:42

Gregor Thomas · Accepted Answer · 2019-01-18T19:43:11.720

If you only have the three data frames, I would recommend

result = (DG + DT + DF) / 3
result
#   y v        x
# 1 1 3 6.666667
# 2 3 4 8.333333
# 3 1 5 6.666667
# 4 3 6 8.333333
# 5 1 7 6.666667
# 6 3 8 8.333333

This assumes that your rows and columns are already in the correct order.

If you have more data frames, put them in a list (see here for help with that) and then you can do this:

result = Reduce("+", list_of_data) / length(list_of_data)

If you need advanced features of mean, like ignoring NAs or trimming, this won't work. Instead, I would recommend using converting your data frames to matrices, stacking them into an 3-d array, and applying mean.

library(abind)
stack = abind(DG, DF, DT, along = 3)
# if you have data frames in a list, do this instead:
# stack = do.call(abind, c(list_of_data, along = 3))

apply(stack, MARGIN = 1:2, FUN = mean, na.rm = TRUE)
#      y v        x
# [1,] 1 3 6.666667
# [2,] 3 4 8.333333
# [3,] 1 5 6.666667
# [4,] 3 6 8.333333
# [5,] 1 7 6.666667
# [6,] 3 8 8.333333

The final method I'll recommend is a "tidy" method - combine your data into one data frame and use grouped operations to produce the result. This can be done easily with data.table or dplyr. See Uwe's answer for a nice data.table implementation.

library(dplyr)
bind_rows(list(DG, DF, DT), .id = ".id") %>%
  group_by(.id) %>%
  mutate(rn = row_number()) %>%
  ungroup() %>%
  select(-.id) %>%
  group_by(rn) %>%
  summarize_all(mean, na.rm = TRUE) %>%
  select(-rn)
# # A tibble: 6 x 3
#       y     v     x
#   <dbl> <dbl> <dbl>
# 1     1     3  6.67
# 2     3     4  8.33
# 3     1     5  6.67
# 4     3     6  8.33
# 5     1     7  6.67
# 6     3     8  8.33

I have more than three data frames and need to ignore NAs, so both of your solutions won´t work for me. I did not point that out before. I tried https://stackoverflow.com/questions/31465415/combine-multiple-data-frames-and-calculate-average some of these solutions (i think 3-d arrays are considered there as well), but it did not work. How can I stack data into an array? — Sophia Dode, Jan 18 '19 at 16:45

Averaging row and column cells from multiple data frames

2 Answers2