0

The challenge is a data.frame with with one group variable (id) and two date variables (start and stop). The date intervals are irregular and I'm trying to calculate the uninterrupted interval in days starting from the first startdate per group.

Example data:

data <- data.frame(
  id = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 5),
  start = as.Date(c("2016-02-18", "2016-12-07", "2016-12-12", "2015-04-10", 
                    "2015-04-12", "2015-04-14", "2015-05-15", "2015-07-14", 
                    "2010-12-08", "2011-03-09")),
  stop = as.Date(c("2016-02-19", "2016-12-12", "2016-12-13", "2015-04-13", 
                   "2015-04-22", "2015-05-13", "2015-07-13", "2015-07-15", 
                   "2010-12-10", "2011-03-11"))
)

> data
   id      start       stop
1   1 2016-02-18 2016-02-19
2   2 2016-12-07 2016-12-12
3   2 2016-12-12 2016-12-13
4   3 2015-04-10 2015-04-13
5   3 2015-04-12 2015-04-22
6   3 2015-04-14 2015-05-13
7   3 2015-05-15 2015-07-13
8   3 2015-07-14 2015-07-15
9   4 2010-12-08 2010-12-10
10  5 2011-03-09 2011-03-11

The aim would a data.frame like this:

   id      start       stop duration_from_start
1   1 2016-02-18 2016-02-19                   2
2   2 2016-12-07 2016-12-12                   7
3   2 2016-12-12 2016-12-13                   7
4   3 2015-04-10 2015-04-13                  34
5   3 2015-04-12 2015-04-22                  34
6   3 2015-04-14 2015-05-13                  34
7   3 2015-05-15 2015-07-13                  34
8   3 2015-07-14 2015-07-15                  34
9   4 2010-12-08 2010-12-10                   3
10  5 2011-03-09 2011-03-11                   3

Or this:

  id      start       stop duration_from_start
1  1 2016-02-18 2016-02-19                   2
2  2 2016-12-07 2016-12-13                   7
3  3 2015-04-10 2015-05-13                  34
4  4 2010-12-08 2010-12-10                   3
5  5 2011-03-09 2011-03-11                   3

It's important to identify the gap from row 6 to 7 and to take this point as the maximum interval (34 days). The interval 2018-10-01to 2018-10-01 would be counted as 1.

My usual lubridate approaches don't work with this example (interval %within lag(interval)).

Any idea?

ceefel
  • 143
  • 1
  • 10

1 Answers1

2
library(magrittr)
library(data.table)
setDT(data)

first_int <- function(start, stop){
  ind <- rleid((start - shift(stop, fill = Inf)) > 0) == 1
  list(start = min(start[ind]),
       stop  = max(stop[ind]))
}

newdata <- 
  data[, first_int(start, stop), by = id] %>% 
     .[, duration := stop - start + 1]


#    id      start       stop duration
# 1:  1 2016-02-18 2016-02-19   2 days
# 2:  2 2016-12-07 2016-12-13   7 days
# 3:  3 2015-04-10 2015-05-13  34 days
# 4:  4 2010-12-08 2010-12-10   3 days
# 5:  5 2011-03-09 2011-03-11   3 days
IceCreamToucan
  • 23,575
  • 2
  • 13
  • 25
  • Clever! Row 5 being a subset of row 4 was really messing up my answer (which I no longer need to post now that you've provided this!). – DanY Oct 02 '18 at 16:09
  • 1
    Just realized the mistake in the example. Should be `2015-04-13`in row 4. Sorry about that! Corrected the mistake above. – ceefel Oct 02 '18 at 16:35