1

I have multiple data frames, like:

DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5))

head(DG)
  y v x
1 1 3 4
2 3 4 6
3 1 5 4
4 3 6 6
5 1 7 4
6 3 8 6

head(DT)
  y v x
1 1 3 4
2 3 4 5
3 1 5 4
4 3 6 5
5 1 7 4
6 3 8 5

head(DF)
  y v  x
1 1 3 12
2 3 4 12
3 1 5 12
4 3 6 12
5 1 7 12
6 3 8 12

I want to calculate means of each 'row' but from each column of each data frame, i.e. the resulting data frame I need looks like:

          y                            v                         x
    1 'mean(DG(y1)DT(y1),DF(y1))' 'mean(DG(v1)DT(v1),DF(v1))' 'mean(DG(x1)DT(x1),DF(x1))'
    2 'mean(DG(y2)DT(y2),DF(y2))' 'mean(DG(v2)DT(v2),DF(v2))' 'mean(DG(x2)DT(x2),DF(x2))'
    3 'mean(DG(y3)DT(y3),DF(y3))' 'mean(DG(v3)DT(v3),DF(v3))' 'mean(DG(x3)DT(x3),DF(x3))'
    ....

In reality, y, v and x are different locations and 1 - 6 time steps. I want to average my data for each time step and location. Eventually, I need one data set, that looks like one of the example data sets, but with averaged values in each cell.

I have a working example with loops, but for large datasets it is very slow, so I tried various combinations with apply and rowSums, but neither worked out.

marc_s
  • 675,133
  • 158
  • 1,253
  • 1,388

2 Answers2

1

If I understand correctly, there are many data frames which all have the same structure (number, name and type of columns) as well as the same number of rows (time steps). Some data points may contain NA.

The code below creates a large data.table from the single data frames and computes the mean values for each time step and location across the different data frames:

library(data.table)
rbindlist(list(DG, DF, DT), idcol = TRUE)[
  , lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]
   time_step y v        x
1:         1 1 3 6.666667
2:         2 3 4 8.333333
3:         3 1 5 6.666667
4:         4 3 6 8.333333
5:         5 1 7 6.666667
6:         6 3 8 8.333333

This will work also with NAs, e.g.,

DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5,NA))

Note that column x of DT has been modified

rbindlist(list(DG, DF, DT), idcol = TRUE)[
  , lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]

   time_step y v         x
1:         1 1 3  6.666667
2:         2 3 4  8.333333
3:         3 1 5  8.000000
4:         4 3 6  8.000000
5:         5 1 7  7.000000
6:         6 3 8 10.000000

Note that x in rows 3 and 6 has changed.

Uwe
  • 34,565
  • 10
  • 75
  • 109
0

If you only have the three data frames, I would recommend

result = (DG + DT + DF) / 3
result
#   y v        x
# 1 1 3 6.666667
# 2 3 4 8.333333
# 3 1 5 6.666667
# 4 3 6 8.333333
# 5 1 7 6.666667
# 6 3 8 8.333333

This assumes that your rows and columns are already in the correct order.

If you have more data frames, put them in a list (see here for help with that) and then you can do this:

result = Reduce("+", list_of_data) / length(list_of_data)

If you need advanced features of mean, like ignoring NAs or trimming, this won't work. Instead, I would recommend using converting your data frames to matrices, stacking them into an 3-d array, and applying mean.

library(abind)
stack = abind(DG, DF, DT, along = 3)
# if you have data frames in a list, do this instead:
# stack = do.call(abind, c(list_of_data, along = 3))

apply(stack, MARGIN = 1:2, FUN = mean, na.rm = TRUE)
#      y v        x
# [1,] 1 3 6.666667
# [2,] 3 4 8.333333
# [3,] 1 5 6.666667
# [4,] 3 6 8.333333
# [5,] 1 7 6.666667
# [6,] 3 8 8.333333

The final method I'll recommend is a "tidy" method - combine your data into one data frame and use grouped operations to produce the result. This can be done easily with data.table or dplyr. See Uwe's answer for a nice data.table implementation.

library(dplyr)
bind_rows(list(DG, DF, DT), .id = ".id") %>%
  group_by(.id) %>%
  mutate(rn = row_number()) %>%
  ungroup() %>%
  select(-.id) %>%
  group_by(rn) %>%
  summarize_all(mean, na.rm = TRUE) %>%
  select(-rn)
# # A tibble: 6 x 3
#       y     v     x
#   <dbl> <dbl> <dbl>
# 1     1     3  6.67
# 2     3     4  8.33
# 3     1     5  6.67
# 4     3     6  8.33
# 5     1     7  6.67
# 6     3     8  8.33
Gregor Thomas
  • 104,719
  • 16
  • 140
  • 257
  • I have more than three data frames and need to ignore NAs, so both of your solutions won´t work for me. I did not point that out before. I tried https://stackoverflow.com/questions/31465415/combine-multiple-data-frames-and-calculate-average some of these solutions (i think 3-d arrays are considered there as well), but it did not work. How can I stack data into an array? – Sophia Dode Jan 18 '19 at 16:45
  • Added example for the array method. – Gregor Thomas Jan 18 '19 at 17:26
  • Both solutions work perfectly fine, thank you very much! – Sophia Dode Jan 20 '19 at 17:01