0

I have complex problem of using dates in dataset. Let me describe the problem and give example.

Data is hospital data, where we have information when person got final diagnose (key_date) and data of events when person has been in hospital (date_in, date_out).

library(tidyverse)

sample_size = 1000
d <- tibble(id = seq(1, sample_size, 1), 
           key_date = sample(seq(as.Date('2004/01/01'), as.Date('2009/01/01'), by="day"), sample_size, replace = T),
           date_in = sample(seq(as.Date('1999/01/01'), as.Date('2005/01/01'), by="day"), sample_size, replace = T)) %>% 
  mutate(
    date_out = date_in + sample(seq(1,20, 1), sample_size, replace = T)    
  )
d


# A tibble: 1,000 x 4
      id key_date   date_in    date_out  
   <dbl> <date>     <date>     <date>    
 1     1 2004-12-11 1999-08-03 1999-08-16
 2     2 2007-04-07 2002-10-29 2002-11-07
 3     3 2006-11-04 2002-07-10 2002-07-14
 4     4 2008-05-17 2001-08-14 2001-08-21
 5     5 2006-10-23 2004-06-17 2004-07-04
 6     6 2004-12-22 2002-02-27 2002-03-06
 7     7 2007-01-13 2001-05-27 2001-06-14
 8     8 2005-01-11 1999-09-17 1999-09-24
 9     9 2008-04-06 2000-10-05 2000-10-07
10    10 2007-03-15 1999-06-13 1999-06-23

We want to put all patients in same timeline so that in new scale every person has diagnose at point 0. Related to this, we scale also date_in and date_out variables to this timeline.

d <- d %>% 
  mutate(date_in_new = date_in - key_date + 1,
         date_out_new = date_out - key_date) 

d
# A tibble: 1,000 x 6
      id key_date   date_in    date_out   date_in_new date_out_new
   <dbl> <date>     <date>     <date>     <drtn>      <drtn>      
 1     1 2004-12-11 1999-08-03 1999-08-16 -1956 days  -1944 days  
 2     2 2007-04-07 2002-10-29 2002-11-07 -1620 days  -1612 days  
 3     3 2006-11-04 2002-07-10 2002-07-14 -1577 days  -1574 days  
 4     4 2008-05-17 2001-08-14 2001-08-21 -2467 days  -2461 days  
 5     5 2006-10-23 2004-06-17 2004-07-04  -857 days   -841 days  
 6     6 2004-12-22 2002-02-27 2002-03-06 -1028 days  -1022 days  
 7     7 2007-01-13 2001-05-27 2001-06-14 -2056 days  -2039 days  
 8     8 2005-01-11 1999-09-17 1999-09-24 -1942 days  -1936 days  
 9     9 2008-04-06 2000-10-05 2000-10-07 -2739 days  -2738 days  
10    10 2007-03-15 1999-06-13 1999-06-23 -2831 days  -2822 days  

Next we want to calculate how many patients has been in hospital 1 year, 2 year and so before the diagnose. For example data for first 3 rows of original data would look like this:

year; days; patients
-6; 13; 1
-5; 13; 2

Which means that days is counted as how many days in hospital person was in new scaled date variable. It is possible that same "event" is divided to two or more years. Year means here 365 days and this reference table helps to understand what are day limits for each year:

year; day_limits
2; c(730, 1094)
1; c(365, 729)
0; c(0, 364)
-1; c(-1 , -365)
-2; c(-366, -730)
-3; c(-731, -1095)
-4; c(-1096, -1460)
-5; c(-1461, -1825)
-6; c(-1826, -2190)

What is the best and fastest way to do this calculation? I know this can be done by for-loop going thru whole dataset and checking which rows identify to year, but I find it to be slow. Idea is to get data frame with variables

  • year: variable according to timeline. 1,2,3,4, … years before the diagnose also 1,2,3,4,… years after the diagnose
  • patients: how different id’s has been in hospital
  • days: how many days patients has spent in hospital during the year

Whole idea of this data wrangle is to visualize how patients hospitalized days are growing before "real" diagnose. Most likely real data looks like this

enter image description here

jkm
  • 41
  • 6

0 Answers0