Fastest way of joining 2 data tables while summarising the data in one of them

Question

Problem

I have 2 data tables,

Movement:

  library(data.table)

  consEx = data.table(
  begin = as.POSIXct(c("2019-04-01 00:00:10"," 2019-04-07 10:00:00","2019-04-10 23:00:00","2019-04-12 20:00:00","2019-04-15 10:00:00",
                       "2019-04-20 10:00:00","2019-04-22 13:30:00","2019-04-10 15:30:00","2019-04-12 21:30:00","2019-04-15 20:00:00")),

  end = as.POSIXct(c("2019-04-01 20:00:00","2019-04-07 15:00:00","2019-04-11 10:00:00", "2019-04-12 23:30:00","2019-04-15 15:00:00",
                     "2019-04-21 12:00:00","2019-04-22 17:30:00","2019-04-10 20:00:00","2019-04-13 05:00:00", "2019-04-15 12:30:00")),

  carId = c(1,1,1,2,2,3,3,4,4,5),
  tripId = c(1:10)
)

And Alerts:

alertsEx = data.table(
  timestamp = as.POSIXct(c("2019-04-01 10:00:00","2019-04-01 10:30:00","2019-04-01 15:00:00","2019-04-15 13:00:00","2019-04-22 14:00:00",
                "2019-04-22 15:10:00","2019-04-22 15:40:00","2019-04-10 16:00:00","2019-04-10 17:00:00","2019-04-13 04:00:00")),
  type = c("T1","T2","T1",'T3',"T1","T1","T3","T2","T2","T1"),
  carId = c(1,1,1,2,3,3,3,4,4,4),
  additionalInfo1 = rnorm(10,mean=10,sd=4)
)

The movement table records a period begin - end at which a car was moving. The alerts table shows when an alert happened to the car, containing the type, timestamp and carId

I need to join those 2 tables, summarising the alerts data by type. When there are multiple alerts of the same type during the same period I need to aggregate additionalInfo1 by taking it's mean.

Current Approach

I'm currently doing this by looping through consEx and applying a function to each row that returns a list with the calculations I need

findAlerts = function(begin,end,carId_2){
  saida = alertsEx[timestamp >= begin & timestamp <= end & carId == carId_2,]
  totals = nrow(saida)

  saida = split(saida,by="type")

  resultsList = list(
    "totals" = 0,
    "t1Count" = 0,
    "t1Mean" = 0,
    "t2Count" = 0,
    "t2Mean" = 0,
    "t3Count" = 0,
    "t3Mean" = 0)
  resultsList[["totals"]] = totals

  types = names(saida)
  if("T1" %in% types){
    resultsList[["t1Count"]] = nrow(saida[["T1"]])
    resultsList[["t1Mean"]] = mean(saida[["T1"]]$additionalInfo1)
  }

  if("T2" %in% types){
    resultsList[["t2Count"]] = nrow(saida[["T2"]])
    resultsList[["t2Mean"]] = mean(saida[["T2"]]$additionalInfo1)
  }

  if("T3" %in% types){
    resultsList[["t3Count"]] = nrow(saida[["T3"]])
    resultsList[["t3Mean"]] = mean(saida[["T3"]]$additionalInfo1)
  }

  return(resultsList)
}

for(i in 1:nrow(consEx)){
  aux = findAlerts(consEx$begin[i],consEx$end[i],consEx$carId[i])
  consEx[i,names(aux) := aux]

}

Expected Output

Which gives the expected output:

                 begin                 end carId tripId totals t1Count    t1Mean t2Count    t2Mean t3Count    t3Mean
 1: 2019-04-01 00:00:10 2019-04-01 20:00:00     1      1      3       2 10.123463       1 14.479288       0  0.000000
 2: 2019-04-07 10:00:00 2019-04-07 15:00:00     1      2      0       0  0.000000       0  0.000000       0  0.000000
 3: 2019-04-10 23:00:00 2019-04-11 10:00:00     1      3      0       0  0.000000       0  0.000000       0  0.000000
 4: 2019-04-12 20:00:00 2019-04-12 23:30:00     2      4      0       0  0.000000       0  0.000000       0  0.000000
 5: 2019-04-15 10:00:00 2019-04-15 15:00:00     2      5      1       0  0.000000       0  0.000000       1  6.598062
 6: 2019-04-20 10:00:00 2019-04-21 12:00:00     3      6      0       0  0.000000       0  0.000000       0  0.000000
 7: 2019-04-22 13:30:00 2019-04-22 17:30:00     3      7      3       2  7.610410       0  0.000000       1 10.218593
 8: 2019-04-10 15:30:00 2019-04-10 20:00:00     4      8      2       0  0.000000       2  9.703278       0  0.000000
 9: 2019-04-12 21:30:00 2019-04-13 05:00:00     4      9      1       1  7.095564       0  0.000000       0  0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00     5     10      0       0  0.000000       0  0.000000       0  0.000000

But this is too slow for my original data, which has 13M rows in consEx and 2M in alerts. There are also several different types, where I need to calculate the mean, minimum and maximum for 2 metrics.

Is there a faster way to write the function or the loop?

Additionally, would writing virtually the same code in python give a better result? I'm considering rewriting it, but it would take a long time and I'm not sure it would solve the issue

Thanks

PS: Another approach I tried was looping through alerts and assigning each of them a tripId from consEx and then aggregating the resulting table, but this seems to be even slower.

score 8 · Accepted Answer · edited Jun 20 '20 at 09:12

This answer suggests two data.table approaches which both use non-equi joins for matching alerts to trips and dcast() for reshaping but differ in the way the aggregates are combined with consEx.

The first variant creates a new result dataset leaving constEx unchanged.

The second variant modifies constEx in place. This is similar to Alexis' solution but more concise and using the new setnafill() function (available with the development version 1.12.3 of data.table)

Variant 1: Create new result dataset

agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
              .(tripId, type, additionalInfo1)][
                , .(Count = .N, Mean = mean(additionalInfo1)), 
                by = .(tripId, type)][
                  , totals := sum(Count), by = tripId][] 
result <- dcast(agg[consEx, on = "tripId"], ... ~ type, 
                value.var = c("Count", "Mean"), fill = 0)[
                  , c("Count_NA", "Mean_NA") := NULL][
                    is.na(totals), totals := 0]
setcolorder(result, names(consEx))
result

                  begin                 end carId tripId totals Count_T1 Count_T2 Count_T3   Mean_T1   Mean_T2   Mean_T3
 1: 2019-04-01 00:00:10 2019-04-01 20:00:00     1      1      3        2        1        0 12.654609 12.375862  0.000000
 2: 2019-04-07 10:00:00 2019-04-07 15:00:00     1      2      0        0        0        0  0.000000  0.000000  0.000000
 3: 2019-04-10 23:00:00 2019-04-11 10:00:00     1      3      0        0        0        0  0.000000  0.000000  0.000000
 4: 2019-04-12 20:00:00 2019-04-12 23:30:00     2      4      0        0        0        0  0.000000  0.000000  0.000000
 5: 2019-04-15 10:00:00 2019-04-15 15:00:00     2      5      1        0        0        1  0.000000  0.000000  9.316815
 6: 2019-04-20 10:00:00 2019-04-21 12:00:00     3      6      0        0        0        0  0.000000  0.000000  0.000000
 7: 2019-04-22 13:30:00 2019-04-22 17:30:00     3      7      3        2        0        1  8.586061  0.000000 11.498512
 8: 2019-04-10 15:30:00 2019-04-10 20:00:00     4      8      2        0        2        0  0.000000  8.696356  0.000000
 9: 2019-04-12 21:30:00 2019-04-13 05:00:00     4      9      1        1        0        0  9.343681  0.000000  0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00     5     10      0        0        0        0  0.000000  0.000000  0.000000

Note that the means are likely to differ between the different answers posted as additionalInfo1 is created by rnorm(10, mean = 10, sd = 4) without a call to set.seed() beforehand. Calling set.seed() creates reproducible random numbers.

Explanation of variant 1

The first part finds the matching trips for each alert by a non-equi join and computes the aggregates by tripId and type as well as the total count for each trip:

agg

   tripId type Count      Mean totals
1:      1   T1     2 12.654609      3
2:      1   T2     1 12.375862      3
3:      5   T3     1  9.316815      1
4:      7   T1     2  8.586061      3
5:      7   T3     1 11.498512      3
6:      8   T2     2  8.696356      2
7:      9   T1     1  9.343681      1

Note that tripId is assumed to be a unique key in consEx.

In the next step, the aggregations are right joined to consEx

agg[consEx, on = "tripId"]

    tripId type Count      Mean totals               begin                 end carId
 1:      1   T1     2 12.654609      3 2019-04-01 00:00:10 2019-04-01 20:00:00     1
 2:      1   T2     1 12.375862      3 2019-04-01 00:00:10 2019-04-01 20:00:00     1
 3:      2 <NA>    NA        NA     NA 2019-04-07 10:00:00 2019-04-07 15:00:00     1
 4:      3 <NA>    NA        NA     NA 2019-04-10 23:00:00 2019-04-11 10:00:00     1
 5:      4 <NA>    NA        NA     NA 2019-04-12 20:00:00 2019-04-12 23:30:00     2
 6:      5   T3     1  9.316815      1 2019-04-15 10:00:00 2019-04-15 15:00:00     2
 7:      6 <NA>    NA        NA     NA 2019-04-20 10:00:00 2019-04-21 12:00:00     3
 8:      7   T1     2  8.586061      3 2019-04-22 13:30:00 2019-04-22 17:30:00     3
 9:      7   T3     1 11.498512      3 2019-04-22 13:30:00 2019-04-22 17:30:00     3
10:      8   T2     2  8.696356      2 2019-04-10 15:30:00 2019-04-10 20:00:00     4
11:      9   T1     1  9.343681      1 2019-04-12 21:30:00 2019-04-13 05:00:00     4
12:     10 <NA>    NA        NA     NA 2019-04-15 20:00:00 2019-04-15 12:30:00     5

The output is immediately reshaped from long to wide format using dcast().

Finally, the result is cleaned up by removing superfluous columns, replacing NA by 0, and changing the column order.

Variant 2: Update in place

agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
              .(tripId, type, additionalInfo1)][
                , .(Count = .N, Mean = mean(additionalInfo1)), 
                by = .(tripId, type)][
                  , totals := sum(Count), by = tripId][] 
wide <- dcast(agg, ... ~ type, value.var = c("Count", "Mean"), fill = 0)
consEx[wide, on = "tripId", (names(wide)) := mget(paste0("i.", names(wide)))]
setnafill(consEx, fill = 0) # data.table version 1.12.3+ 

consEx

                  begin                 end carId tripId totals Count_T1 Count_T2 Count_T3   Mean_T1   Mean_T2   Mean_T3
 1: 2019-04-01 00:00:10 2019-04-01 20:00:00     1      1      3        2        1        0 12.654609 12.375862  0.000000
 2: 2019-04-07 10:00:00 2019-04-07 15:00:00     1      2      0        0        0        0  0.000000  0.000000  0.000000
 3: 2019-04-10 23:00:00 2019-04-11 10:00:00     1      3      0        0        0        0  0.000000  0.000000  0.000000
 4: 2019-04-12 20:00:00 2019-04-12 23:30:00     2      4      0        0        0        0  0.000000  0.000000  0.000000
 5: 2019-04-15 10:00:00 2019-04-15 15:00:00     2      5      1        0        0        1  0.000000  0.000000  9.316815
 6: 2019-04-20 10:00:00 2019-04-21 12:00:00     3      6      0        0        0        0  0.000000  0.000000  0.000000
 7: 2019-04-22 13:30:00 2019-04-22 17:30:00     3      7      3        2        0        1  8.586061  0.000000 11.498512
 8: 2019-04-10 15:30:00 2019-04-10 20:00:00     4      8      2        0        2        0  0.000000  8.696356  0.000000
 9: 2019-04-12 21:30:00 2019-04-13 05:00:00     4      9      1        1        0        0  9.343681  0.000000  0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00     5     10      0        0        0        0  0.000000  0.000000  0.000000

Explanation of variant 2

The first part is the same as in variant 1.

In the next step, the aggregates are reshaped from long to wide format.

Then, an update join is performed where consEx is updated in place by the matching rows of wide. The expression

(names(wide)) := mget(paste0("i.", names(wide)))

is a short cut for

`:=`(totals = i.totals, Count_T1 = i.Count_T1, ..., Mean_T3 = i.Mean_T3)

where the i. prefix refers to columns of wide and is used to avoid ambiguities.

In rows of consEx where there is no match, the appended columns contain NA. So, setnafill() is used to replace all NA values by 0, finally.

Benchmark

At the time of writing, 6 different solutions were posted by

Fino the OP in base R
Dave2e using dplyr and tidyr
Alexandre B in python
Alexis using data.table
and 2 data.table variant by myself.

The 5 R solutions are compared using press() and mark() from the bench package.

The sample datasets are parameterized to be varied in number of rows. As consEx is modified by some solutions, each benchmark run starts with a fresh copy.

Some minor changes were required to make the codes work with bench. Most remarkably, the shortcut formula ... ~ type in dcast() caused an error when called within bench::mark() and had to be replaced by a version where all variables of the LHS are explicetly named.

As the OP has pointed out that his version is rather slow, the first benchmark run only uses 1000 rows of consEx and 50, 200, and 1000 rows of alertsEx

bm <- bench::press(
  n_trips = 1E3,
  n_alerts = 1E3/c(20, 5, 1),
  {
    n_cars <- 5
    n_periods <- round(n_trips / n_cars)
    n_types = 3
    types = sprintf("T%02i", 1:n_types)
    consEx0 <- data.table(
      tripId = 1:n_trips, 
      carId = rep(1:n_cars, each = n_periods),
      begin = seq.POSIXt(as.POSIXct("2000-01-01"), length.out = n_periods, by = "14 days") %>% 
        rep(n_cars) %>% 
        `mode<-`("double")
    )[, end := begin + 7*24*60*60]
    set.seed(1)
    idx <- sample(n_trips, n_alerts, TRUE)
    alertsEx <- consEx0[
      idx, 
      .(timestamp = begin + runif(n_alerts, 1, 7*24*60*60 - 1),
        type = sample(types, n_alerts, TRUE), 
        carId,
        additionalInfo1 = rnorm(n_alerts, mean = 10, sd = 4))]
    
    bench::mark(
      fino = {
        consEx <- copy(consEx0)
        findAlerts = function(begin,end,carId_2){
          saida = alertsEx[timestamp >= begin & timestamp <= end & carId == carId_2,]
          totals = nrow(saida)
          saida = split(saida,by="type")
          resultsList = list(
            "totals" = 0,
            "t1Count" = 0,
            "t1Mean" = 0,
            "t2Count" = 0,
            "t2Mean" = 0,
            "t3Count" = 0,
            "t3Mean" = 0)
          resultsList[["totals"]] = totals
          types = names(saida)
          if("T1" %in% types){
            resultsList[["t1Count"]] = nrow(saida[["T1"]])
            resultsList[["t1Mean"]] = mean(saida[["T1"]]$additionalInfo1)
          }
          if("T2" %in% types){
            resultsList[["t2Count"]] = nrow(saida[["T2"]])
            resultsList[["t2Mean"]] = mean(saida[["T2"]]$additionalInfo1)
          }
          if("T3" %in% types){
            resultsList[["t3Count"]] = nrow(saida[["T3"]])
            resultsList[["t3Mean"]] = mean(saida[["T3"]]$additionalInfo1)
          }
          return(resultsList)
        }
        for(i in 1:nrow(consEx)){
          aux = findAlerts(consEx$begin[i],consEx$end[i],consEx$carId[i])
          consEx[i,names(aux) := aux]
        }
        consEx[]
      },
      dave = {
        # library(dplyr)
        # library(tidyr)
        
        consEx <- copy(consEx0)
        #split the consEx down to separate carId
        splitcons <- split(consEx, consEx$carId)
        
        alertsEx$tripId <- NA
        #loop and only search the cons that match the alerts
        alertsEx$tripId <- apply(alertsEx, 1, function(a) {
          #retrive the list assicoated to the correct car
          cons <- splitcons[[a[["carId"]]]]
          alerttime <- as.POSIXct(a[["timestamp"]])
          #find the trip which contains the alerttime
          tripId <-
            which((alerttime >= cons$begin) & (alerttime <= cons$end))
          #identify and return the tripId
          cons$tripId[tripId]
        })
        
        #Referenced:
        #https://stackoverflow.com/questions/30592094/r-spreading-multiple-columns-with-tidyr
        #https://stackoverflow.com/questions/35321497/spread-multiple-columns-tidyr
        alertsEx2 <-
          alertsEx %>% 
          dplyr::group_by(carId, tripId, type) %>% 
          dplyr::summarize(mean = mean(additionalInfo1), count = dplyr::n())
        alerttable <-
          tidyr::gather(alertsEx2, variable, val, -(carId:type), na.rm = TRUE) %>%
          tidyr::unite(temp, type, variable) %>%
          tidyr::spread(temp, val, fill = 0)
        consEx %>% 
          dplyr::left_join(alerttable, by = c("tripId", "carId")) 
        
      },
      alexis = {
        consEx <- copy(consEx0)
        joined <- consEx[alertsEx,
                         .(carId, tripId, type, additionalInfo1),
                         on = .(carId, begin <= timestamp, end >= timestamp)]
        aggregated <- joined[, .(typeCount = .N, typeMean = mean(additionalInfo1)), by = .(carId, tripId, type)]
        totals <- aggregated[, .(totals = sum(typeCount)), by = .(carId, tripId)]
        long <- dcast(aggregated, carId + tripId ~ type, value.var = c("typeCount", "typeMean"), sep = "", fill = 0)
        replaceNA <- function(x) { replace(x, is.na(x), 0) }
        consEx[, `:=`(as.character(outer(types, c("Count", "Mean"), paste0)),
                      lapply(long[consEx,
                                  as.character(outer(types, c("typeCount", "typeMean"),
                                                     function(a, b) { paste0(b, a) })),
                                  with = FALSE,
                                  on = .(carId, tripId)],
                             replaceNA))]
        consEx[, totals := lapply(totals[consEx, x.totals, on = .(carId, tripId)], replaceNA)]
        setcolorder(consEx, c("carId", "tripId", "begin", "end"))
        consEx
      },
      uwe1 = {
        consEx <- copy(consEx0)
        agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
                      .(tripId, type, additionalInfo1)][
                        , .(Count = .N, Mean = mean(additionalInfo1)), 
                        by = .(tripId, type)][
                          , totals := sum(Count), by = tripId][] 
        result <- dcast(agg[consEx, on = "tripId"], 
                        tripId + carId + begin+ end + totals ~ type, 
                        value.var = c("Count", "Mean"), fill = 0)[
                          , c("Count_NA", "Mean_NA") := NULL][
                            is.na(totals), totals := 0][]
        # setcolorder(result, names(consEx))
        result
      },
      uwe2 = {
        consEx <- copy(consEx0)
        agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
                      .(tripId, type, additionalInfo1)][
                        , .(Count = .N, Mean = mean(additionalInfo1)), 
                        by = .(tripId, type)][
                          , totals := sum(Count), by = tripId][] 
        wide <- dcast(agg, tripId + totals ~ type, value.var = c("Count", "Mean"), fill = 0)
        consEx[wide, on = "tripId", (names(wide)) := mget(paste0("i.", names(wide)))]
        setnafill(consEx, fill = 0) # data.table version 1.12.3+ 
        consEx
      },
      check = FALSE
    )
  }
)

library(ggplot2)
autoplot(bm)

Please, note the logarithmic time scale.

The chart confirms OP's assumption that his approach is magnitudes slower than any other solutions. It took 3 seconds for 1000 rows while OP's production dataset has 12 M rows. The run time of Dave2e's approach increases with the number of alerts.

Also, OP's approach is most demanding concerning allocated memory. It allocates up to 190 MBytes while the data.table versions only require 1 MByts.

bm

# A tibble: 15 x 15
   expression n_trips n_alerts      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
   <bch:expr>   <dbl>    <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
 1 fino          1000       50    2.95s    2.95s     0.339  149.25MB     3.39     1    10      2.95s <data~ <Rpro~ <bch~
 2 dave          1000       50  21.52ms  22.61ms    41.6    779.46KB     5.94    21     3   505.01ms <data~ <Rpro~ <bch~
 3 alexis        1000       50  21.98ms  23.09ms    41.2   1007.84KB     3.93    21     2    509.5ms <data~ <Rpro~ <bch~
 4 uwe1          1000       50  14.47ms  15.45ms    62.3    919.12KB     1.95    32     1   513.94ms <data~ <Rpro~ <bch~
 5 uwe2          1000       50  14.14ms  14.73ms    64.2     633.2KB     1.95    33     1   513.91ms <data~ <Rpro~ <bch~
 6 fino          1000      200    3.09s    3.09s     0.323  155.12MB     4.53     1    14      3.09s <data~ <Rpro~ <bch~
 7 dave          1000      200   53.4ms  62.11ms    11.0       1.7MB     5.49     8     4   728.02ms <data~ <Rpro~ <bch~
 8 alexis        1000      200  22.42ms  23.34ms    41.5      1.03MB     3.77    22     2   530.25ms <data~ <Rpro~ <bch~
 9 uwe1          1000      200  14.72ms  15.76ms    62.2    953.07KB     0       32     0   514.46ms <data~ <Rpro~ <bch~
10 uwe2          1000      200  14.49ms  15.17ms    63.6    695.63KB     1.99    32     1    503.4ms <data~ <Rpro~ <bch~
11 fino          1000     1000     3.6s     3.6s     0.278  187.32MB     3.61     1    13       3.6s <data~ <Rpro~ <bch~
12 dave          1000     1000 242.46ms  243.7ms     4.07      6.5MB     6.78     3     5   737.06ms <data~ <Rpro~ <bch~
13 alexis        1000     1000  24.32ms  25.78ms    37.6      1.21MB     3.95    19     2   505.84ms <data~ <Rpro~ <bch~
14 uwe1          1000     1000  16.04ms   16.8ms    56.8      1.05MB     1.96    29     1   510.23ms <data~ <Rpro~ <bch~
15 uwe2          1000     1000  15.69ms  16.41ms    54.8    938.74KB     3.92    28     2   510.63ms <data~ <Rpro~ <bch~

So, Fino's approach is omitted from the second benchmark run with larger problem sizes of 10k and 100k rows for consEx and 2k and 20k rows for alertsEx, resp.

As Dave2e's approach is about 100 times slower for 20k rows than the fastest approach it is omitted from the third run. This will simulate OP's production dataset of 12M rows consEx and 2M rows of alertsEx:

print(bm, n = Inf)

# A tibble: 27 x 15
   expression n_trips n_alerts      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
   <bch:expr>   <dbl>    <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
 1 alexis      1.00e5    20000 747.87ms 747.87ms  1.34       38.18MB  6.69        1     5   747.87ms <data~ <Rpro~ <bch~
 2 uwe1        1.00e5    20000 124.52ms 138.89ms  7.32       37.88MB  3.66        4     2   546.17ms <data~ <Rpro~ <bch~
 3 uwe2        1.00e5    20000  72.29ms  73.76ms 11.3        16.63MB  1.88        6     1   533.26ms <data~ <Rpro~ <bch~
 4 alexis      1.00e6    20000    7.51s    7.51s  0.133     335.44MB  3.33        1    25      7.51s <data~ <Rpro~ <bch~
 5 uwe1        1.00e6    20000 995.31ms 995.31ms  1.00      346.94MB  2.01        1     2   995.31ms <data~ <Rpro~ <bch~
 6 uwe2        1.00e6    20000 714.19ms 714.19ms  1.40       89.27MB  1.40        1     1   714.19ms <data~ <Rpro~ <bch~
 7 alexis      1.00e7    20000    1.67m    1.67m  0.00999     3.21GB  0.340       1    34      1.67m <data~ <Rpro~ <bch~
 8 uwe1        1.00e7    20000    3.44m    3.44m  0.00485     3.41GB  0.00969     1     2      3.44m <data~ <Rpro~ <bch~
 9 uwe2        1.00e7    20000   42.51s   42.51s  0.0235     810.3MB  0           1     0     42.51s <data~ <Rpro~ <bch~
10 alexis      1.00e5   200000   15.38s   15.38s  0.0650     73.22MB  0.0650      1     1     15.38s <data~ <Rpro~ <bch~
11 uwe1        1.00e5   200000    1.34s    1.34s  0.747      63.81MB  0           1     0      1.34s <data~ <Rpro~ <bch~
12 uwe2        1.00e5   200000    1.47s    1.47s  0.681      58.98MB  0           1     0      1.47s <data~ <Rpro~ <bch~
13 alexis      1.00e6   200000    2.91m    2.91m  0.00573   375.93MB  0.0115      1     2      2.91m <data~ <Rpro~ <bch~
14 uwe1        1.00e6   200000    9.72s    9.72s  0.103     371.69MB  0           1     0      9.72s <data~ <Rpro~ <bch~
15 uwe2        1.00e6   200000 888.67ms 888.67ms  1.13      161.82MB  0           1     0   888.67ms <data~ <Rpro~ <bch~
16 alexis      1.00e7   200000    6.29m    6.29m  0.00265     3.15GB  0.0928      1    35      6.29m <data~ <Rpro~ <bch~
17 uwe1        1.00e7   200000    2.45m    2.45m  0.00681     3.43GB  0.0136      1     2      2.45m <data~ <Rpro~ <bch~
18 uwe2        1.00e7   200000   12.48m   12.48m  0.00134   887.99MB  0.00134     1     1     12.48m <data~ <Rpro~ <bch~
19 alexis      1.00e5  2000000    3.04s    3.04s  0.329        207MB  0           1     0      3.04s <data~ <Rpro~ <bch~
20 uwe1        1.00e5  2000000    2.96s    2.96s  0.338     196.42MB  0           1     0      2.96s <data~ <Rpro~ <bch~
21 uwe2        1.00e5  2000000    2.81s    2.81s  0.355     187.79MB  0           1     0      2.81s <data~ <Rpro~ <bch~
22 alexis      1.00e6  2000000    6.96m    6.96m  0.00239   726.14MB  0.00479     1     2      6.96m <data~ <Rpro~ <bch~
23 uwe1        1.00e6  2000000    2.01m    2.01m  0.00827    631.1MB  0           1     0      2.01m <data~ <Rpro~ <bch~
24 uwe2        1.00e6  2000000   30.54s   30.54s  0.0327    584.81MB  0           1     0     30.54s <data~ <Rpro~ <bch~
25 alexis      1.00e7  2000000   31.54m   31.54m  0.000528    3.66GB  0.0127      1    24     31.54m <data~ <Rpro~ <bch~
26 uwe1        1.00e7  2000000    8.72m    8.72m  0.00191     3.67GB  0           1     0      8.72m <data~ <Rpro~ <bch~
27 uwe2        1.00e7  2000000   12.35m   12.35m  0.00135     1.58GB  0           1     0     12.35m <data~ <Rpro~ <bch~

Caveat: Please, note that run times for most of the parameter variations are only confirmed by one measurement. In addition, garbage collection (gc) may have an influence due to the limited amount of RAM on my computer (8 GBytes). So, your mileage may vary.

The timings are not consistent. In general, uwe2 (in-place update) is substantially faster than uwe1 (new result set) which confirms jangorecki's assumption. alexis is almost always the slowest. (Please note the logarithmic time scale in the chart or compare the timimgs in the table.)

However, some parameter variations show deviations from above pattern. For 10M trips uwe1 is faster than uwe2 in two cases and alexis is also catching up. These effects might be due to garbage collection.

Memory allocation is much less for the in-place update approach (even if this does not always pay out in terms of speed).

Again, interpret the results with care. The third benchmark run should be repeated on a larger machine and with more patience to allow for repeated measurements.

Variant 2 should be preferred as OP mentioned big datasets so copies should be avoided. TLDR would be useful in your answer, it is now difficult to reach the best solution among all the posted ones, and your variant 2 I believe will perform best. — jangorecki, May 25 '19 at 11:44
Thanks for the amazing answer. I was planning on doing the benchmark myself but didn't find the time. I went with your second variant with some adjustments and it ran for about 5 minutes. Great improvement from the 9 days it would take my original approach to finish. — Fino, May 27 '19 at 18:43
Uwe, very nice and thorough answer. I knew my answer was an improvement and also far from optimal, I am glad you were able to provide it. — Dave2e, May 28 '19 at 01:50
Very nice! If the brief tests I did locally don't deceive me, the main speed-up comes from `setnafill`, which is not surprising, since the R version that I used has to copy data a couple of times. — Alexis, Jun 15 '19 at 20:08

Alexis · Answer 2 · 2019-07-05T18:19:26.853

Here's my attempt. Let me know if you think it might be right, or if it's fast enough.

EDIT: now that table.express v0.2.0 is out, you could express the calculation as follows:

types <- alertsEx[, unique(type)]

aggregated <- consEx %>%
    start_expr %>%
    right_join(alertsEx, carId, begin <= timestamp, end >= timestamp) %>%
    select(carId, tripId, type, additionalInfo1) %>%
    chain %>%
    group_by(carId, tripId, type) %>%
    transmute(typeCount = .N, typeMean = mean(additionalInfo1)) %>%
    group_by(carId, tripId) %>%
    mutate(totals = sum(typeCount)) %>%
    end_expr

wide <- data.table::dcast(
    aggregated,
    ... ~ type,
    value.var = c("typeCount", "typeMean"),
    sep = "",
    fill = 0
)

# as.character removes dimensions
sd_cols <- as.character(outer(types, c("typeCount", "typeMean"), function(a, b) { paste0(b, a) }))
names(sd_cols) <- as.character(outer(types, c("Count", "Mean"), paste0))
sd_cols <- c(sd_cols, totals = "totals")

consEx %>%
    start_expr %>%
    mutate_join(wide, carId, tripId, .SDcols = sd_cols) %>%
    end_expr %>%
    setnafill(fill = 0) %>%
    setcolorder(c("carId", "tripId", "begin", "end"))

Original answer:

library(data.table)

set.seed(322902L)

consEx = data.table(
  begin = as.POSIXct(c("2019-04-01 00:00:10"," 2019-04-07 10:00:00","2019-04-10 23:00:00","2019-04-12 20:00:00","2019-04-15 10:00:00",
                       "2019-04-20 10:00:00","2019-04-22 13:30:00","2019-04-10 15:30:00","2019-04-12 21:30:00","2019-04-15 20:00:00")),

  end = as.POSIXct(c("2019-04-01 20:00:00","2019-04-07 15:00:00","2019-04-11 10:00:00", "2019-04-12 23:30:00","2019-04-15 15:00:00",
                     "2019-04-21 12:00:00","2019-04-22 17:30:00","2019-04-10 20:00:00","2019-04-13 05:00:00", "2019-04-15 12:30:00")),

  carId = c(1,1,1,2,2,3,3,4,4,5),
  tripId = c(1:10)
)

alertsEx = data.table(
  timestamp = as.POSIXct(c("2019-04-01 10:00:00","2019-04-01 10:30:00","2019-04-01 15:00:00","2019-04-15 13:00:00","2019-04-22 14:00:00",
                           "2019-04-22 15:10:00","2019-04-22 15:40:00","2019-04-10 16:00:00","2019-04-10 17:00:00","2019-04-13 04:00:00")),
  type = c("T1","T2","T1",'T3',"T1","T1","T3","T2","T2","T1"),
  carId = c(1,1,1,2,3,3,3,4,4,4),
  additionalInfo1 = rnorm(10,mean=10,sd=4)
)

types <- unique(alertsEx$type)

joined <- consEx[alertsEx,
                 .(carId, tripId, type, additionalInfo1),
                 on = .(carId, begin <= timestamp, end >= timestamp)]

aggregated <- joined[, .(typeCount = .N, typeMean = mean(additionalInfo1)), by = .(carId, tripId, type)]

totals <- aggregated[, .(totals = sum(typeCount)), by = .(carId, tripId)]

wide <- dcast(aggregated, carId + tripId ~ type, value.var = c("typeCount", "typeMean"), sep = "", fill = 0)

replaceNA <- function(x) { replace(x, is.na(x), 0) }

consEx[, `:=`(as.character(outer(types, c("Count", "Mean"), paste0)),
              lapply(wide[consEx,
                          as.character(outer(types, c("typeCount", "typeMean"),
                                             function(a, b) { paste0(b, a) })),
                          with = FALSE,
                          on = .(carId, tripId)],
                     replaceNA))]

consEx[, totals := lapply(totals[consEx, x.totals, on = .(carId, tripId)], replaceNA)]

setcolorder(consEx, c("carId", "tripId", "begin", "end"))
print(consEx)
    carId tripId               begin                 end T1Count T2Count T3Count    T1Mean    T2Mean    T3Mean totals
 1:     1      1 2019-04-01 00:00:10 2019-04-01 20:00:00       2       1       0 10.458406  9.170923  0.000000      3
 2:     1      2 2019-04-07 10:00:00 2019-04-07 15:00:00       0       0       0  0.000000  0.000000  0.000000      0
 3:     1      3 2019-04-10 23:00:00 2019-04-11 10:00:00       0       0       0  0.000000  0.000000  0.000000      0
 4:     2      4 2019-04-12 20:00:00 2019-04-12 23:30:00       0       0       0  0.000000  0.000000  0.000000      0
 5:     2      5 2019-04-15 10:00:00 2019-04-15 15:00:00       0       0       1  0.000000  0.000000 12.694035      1
 6:     3      6 2019-04-20 10:00:00 2019-04-21 12:00:00       0       0       0  0.000000  0.000000  0.000000      0
 7:     3      7 2019-04-22 13:30:00 2019-04-22 17:30:00       2       0       1 10.998666  0.000000  6.812388      3
 8:     4      8 2019-04-10 15:30:00 2019-04-10 20:00:00       0       2       0  0.000000 11.600738  0.000000      2
 9:     4      9 2019-04-12 21:30:00 2019-04-13 05:00:00       1       0       0  6.369317  0.000000  0.000000      1
10:     5     10 2019-04-15 20:00:00 2019-04-15 12:30:00       0       0       0  0.000000  0.000000  0.000000      0

Dave2e · Answer 3 · 2019-05-24T19:53:38.660

One of the reasons for the delay is the looping through the larger data frame. If you reverse the search and step through the alert array that should improve performance somewhat.

I attempted a divide and conquer technique of splitting the consEx dataframe into a list by carid in order to speed the search. Once the trip Id was assigned to alertsEx data frame, it then a manner of summarizing the table and putting into the correct format.

Here is a solution using data.frames and the dplyr and tidyr packages.

library(dplyr)
library(tidyr)

#split the consEx down to separate carId
splitcons<-split(consEx, consEx$carId)

alertsEx$tripid<-NA
#loop and only search the cons that match the alerts
alertsEx$tripid<-apply(alertsEx, 1, function(a){
  #retrive the list assicoated to the correct car
  cons<-splitcons[[a[["carId"]] ]]
  alerttime<-as.POSIXct(a[["timestamp"]])
  #find the trip which contains the alerttime
  tripid<- which((alerttime>= cons$begin) & (alerttime<=cons$end))
  #identify and return the tripid
  cons$tripId[tripid]
})

#Referenced:
#https://stackoverflow.com/questions/30592094/r-spreading-multiple-columns-with-tidyr
#https://stackoverflow.com/questions/35321497/spread-multiple-columns-tidyr
alertsEx2<-alertsEx %>% group_by(carId, tripid, type) %>% summarize( mean=mean(additionalInfo1), count=n())
alerttable<-gather(alertsEx2, variable, val, -(carId:type), na.rm=TRUE) %>% 
  unite(temp, type, variable) %>%
  spread(temp, val, fill=0)


# A tibble: 5 x 8
# Groups:   carId, trip [5]
 carId tripid T1_count T1_mean T2_count T2_mean T3_count T3_mean
  <dbl> <int>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>
1     1     1        2    9.50        1   14.6         0     0  
2     2     5        0    0           0    0           1    11.2
3     3     7        2    6.86        0    0           1    17.7
4     4     8        0    0           2    7.95        0     0  
5     4     9        1   11.4         0    0           0     0  

#last step to complete is to left_join  "alertstable" to "consEx" using tripid columns

I sure there is still a more efficient dplyr/purrr or data.table method for the trip id search.

score 2 · Answer 4 · answered May 24 '19 at 20:09

I would suggest you one way to do in Python using the pandas package. It's closely the same as in R, as has suggested @Dave2e. I use a apply function over rows to find matching rows from alertsEx dataframe. Then, those matching rows are grouped by type to compute mean and count.

Here the details:

# Import module
import pandas as pd
import numpy as np

# Define dataframe
consEx = pd.DataFrame({
        "begin":["2019-04-01 00:00:10"," 2019-04-07 10:00:00","2019-04-10 23:00:00","2019-04-12 20:00:00","2019-04-15 10:00:00",
                "2019-04-20 10:00:00","2019-04-22 13:30:00","2019-04-10 15:30:00","2019-04-12 21:30:00","2019-04-15 20:00:00"],

        "end": ["2019-04-01 20:00:00","2019-04-07 15:00:00","2019-04-11 10:00:00", "2019-04-12 23:30:00","2019-04-15 15:00:00",
            "2019-04-21 12:00:00","2019-04-22 17:30:00","2019-04-10 20:00:00","2019-04-13 05:00:00", "2019-04-15 12:30:00"],

        "carId": [1,1,1,2,2,3,3,4,4,5],
        "tripId": [i for i in range (1,11)]
    })
alertsEx = pd.DataFrame({
        "timestamp": ["2019-04-01 10:00:00","2019-04-01 10:30:00","2019-04-01 15:00:00","2019-04-15 13:00:00","2019-04-22 14:00:00",
                        "2019-04-22 15:10:00","2019-04-22 15:40:00","2019-04-10 16:00:00","2019-04-10 17:00:00","2019-04-13 04:00:00"],
        "type": ["T1","T2","T1",'T3',"T1","T1","T3","T2","T2","T1"],
        "carId": [1,1,1,2,3,3,3,4,4,4],
        "additionalInfo1": np.random.normal(10, 4, 10)
    })

# Optional : convert date as date object
# consEx["begin"] = pd.to_datetime(consEx["begin"])
# consEx["end"] = pd.to_datetime(consEx["end"])
# alertsEx.timestamp = pd.to_datetime(alertsEx.timestamp)

# Get all the type possible
list_of_type = np.unique(alertsEx.type)
print(list_of_type)
# ['T1' 'T2' 'T3']

# Function apply at each row
def getExtraColums(row):
    # Select date matching row from alertsEx dataframe
    matching_rows = alertsEx[(alertsEx.timestamp >= row.begin) & (alertsEx.timestamp < row.end)]
    # Define total as the number of rows
    row["total"] = matching_rows.shape[0]

    # Iterate over the matching rows group by type
    for name, sub_df in matching_rows.groupby(by="type"):
        # Define count value
        row[name+"_count"] = sub_df.shape[0]
        # Define mean value
        row[name+"_mean"] = sub_df.additionalInfo1.mean()
    return row

# Apply the function : axis=1 for rows
consEx = consEx.apply(getExtraColums, axis=1)

#Re order the columns names according your output expected
new_order = [6, 8, 7, 10, 9, 0,1,2,3,4,5 ]
consEx = consEx[consEx.columns[new_order]]
print(consEx)
#                 begin                 end  carId  tripId  total  T1_count    T1_mean  T2_count    T2_mean  T3_count   T3_mean
# 0 2019-04-01 00:00:10 2019-04-01 20:00:00      1       1      3       2.0   7.028186       1.0   9.895152       NaN       NaN
# 1 2019-04-07 10:00:00 2019-04-07 15:00:00      1       2      0       NaN        NaN       NaN        NaN       NaN       NaN
# 2 2019-04-10 23:00:00 2019-04-11 10:00:00      1       3      0       NaN        NaN       NaN        NaN       NaN       NaN
# 3 2019-04-12 20:00:00 2019-04-12 23:30:00      2       4      0       NaN        NaN       NaN        NaN       NaN       NaN
# 4 2019-04-15 10:00:00 2019-04-15 15:00:00      2       5      1       NaN        NaN       NaN        NaN       1.0  8.629333
# 5 2019-04-20 10:00:00 2019-04-21 12:00:00      3       6      0       NaN        NaN       NaN        NaN       NaN       NaN
# 6 2019-04-22 13:30:00 2019-04-22 17:30:00      3       7      3       2.0  10.084251       NaN        NaN       1.0  3.845562
# 7 2019-04-10 15:30:00 2019-04-10 20:00:00      4       8      2       NaN        NaN       2.0  12.523441       NaN       NaN
# 8 2019-04-12 21:30:00 2019-04-13 05:00:00      4       9      1       1.0   7.363178       NaN        NaN       NaN       NaN
# 9 2019-04-15 20:00:00 2019-04-15 12:30:00      5      10      0       NaN        NaN       NaN        NaN       NaN       NaN

consEx = consEx.fillna(0)
print(consEx)
#                 begin                 end  carId  tripId  total  T1_count    T1_mean  T2_count    T2_mean  T3_count   T3_mean
# 0 2019-04-01 00:00:10 2019-04-01 20:00:00      1       1      3       2.0   7.028186       1.0   9.895152       0.0  0.000000
# 1 2019-04-07 10:00:00 2019-04-07 15:00:00      1       2      0       0.0   0.000000       0.0   0.000000       0.0  0.000000
# 2 2019-04-10 23:00:00 2019-04-11 10:00:00      1       3      0       0.0   0.000000       0.0   0.000000       0.0  0.000000
# 3 2019-04-12 20:00:00 2019-04-12 23:30:00      2       4      0       0.0   0.000000       0.0   0.000000       0.0  0.000000
# 4 2019-04-15 10:00:00 2019-04-15 15:00:00      2       5      1       0.0   0.000000       0.0   0.000000       1.0  8.629333
# 5 2019-04-20 10:00:00 2019-04-21 12:00:00      3       6      0       0.0   0.000000       0.0   0.000000       0.0  0.000000
# 6 2019-04-22 13:30:00 2019-04-22 17:30:00      3       7      3       2.0  10.084251       0.0   0.000000       1.0  3.845562
# 7 2019-04-10 15:30:00 2019-04-10 20:00:00      4       8      2       0.0   0.000000       2.0  12.523441       0.0  0.000000
# 8 2019-04-12 21:30:00 2019-04-13 05:00:00      4       9      1       1.0   7.363178       0.0   0.000000       0.0  0.000000
# 9 2019-04-15 20:00:00 2019-04-15 12:30:00      5      10      0       0.0   0.000000       0.0   0.000000       0.0  0.000000

That can give you some ideas.

I ignore if it's more efficient than in R, let us know !

Perhaps you can benchmark that since you are suggesting it. `.apply` in Python Pandas is known to be very slow I thought. Here are some benchmarks I've done and I've found pandas to be slower and more memory demanding which causes it to fail earlier than data.table in R. https://h2oai.github.io/db-benchmark/ — Matt Dowle, May 27 '19 at 16:53

score 1 · Answer 5 · answered May 27 '19 at 18:55

One approach is to use left join and mark the entries where the timestamp falls between begin and end. If the join sizes are going to be an issue, joins can be done individually for each carId.

df = consEx %>% 
  left_join(alertsEx, by="carId") %>% 
  mutate(AlertEntry = ifelse(timestamp>= begin & timestamp<=end, "Yes", "No")) %>% 
  filter(AlertEntry == "Yes") %>% 
  group_by(carId, tripId, type) %>% 
  summarise(Total = n(),
            Mean = mean(additionalInfo1))

# Create Summary
df_summary = df %>% 
  group_by(carId, tripId) %>% 
  gather(key, value, -c(carId, tripId, type), na.rm=T) %>% 
  unite(newType, type, key) %>% 
  spread(newType, value, fill=0)