2

I am looking to collapse rows of data by group based on specified time difference (i.e. 60 mins) between timestamps and/or until a particular condition is met within the data. Here is a mock data frame of what I am working with:-

UserId<-c("2203af12ce3e", "2203af12ce3e", "2203af12ce3e", "2203af12ce3e", 
          "3b9c32d4c700", "3b9c32d4c700", "3b9c32d4c700", "3b9c32d4c700", 
          "3b9c32d4c700", "68b25fd3ca78", "68b25fd3ca78", "68b25fd3ca78", 
          "68b25fd3ca78", "68b25fd3ca78", "fbbd0e13e61b", "fbbd0e13e61b", 
          "fbbd0e13e61b", "808dcbe0cad2", "6f2020841f9e", "faf849c35400", 
          "02002044e512", "02002044e512", "02002044e512", "02002044e512", 
          "ff10b8560791", "ff10b8560791", "ff10b8560791", "ff10b8560791", 
          "ff10b8560791", "ff10b8560791", "ff10b8560791", "ff10b8560791", 
          "ff10b8560791", "ff10b8560791", "ff10b8560791", "ff10b8560791", 
          "ff10b8560791", "ff10b8560791", "ff10b8560791", "ff10b8560791")


OrigTime<-structure(c(1546313039, 1546313070, 1546313116, 1546344887, 1546366783, 
                      1546371206, 1546378029, 1546380713, 1546381727, 1546317095, 1546317335, 
                      1546319551, 1546347453, 1546355351, 1546381815, 1546381844, 1546381873, 
                      1546355462, 1546370527, 1546354015, 1546310854, 1546311154, 1546311218, 
                      1546311772, 1546308212, 1546308230, 1546308248, 1546308268, 1546308298, 
                      1546308317, 1546330603, 1546330620, 1546330637, 1546330650, 1546330676, 
                      1546334052, 1546334129, 1546334523, 1546334867, 1546334902), class = c("POSIXct", 
                                                                                             "POSIXt"), tzone = "UTC")

LastTime<-structure(c(1546313039, 1546313070, 1546313448, 1546344887, 1546366783, 
                      1546371206, 1546378029, 1546380713, 1546381727, 1546317095, 1546318123, 
                      1546319551, 1546347453, 1546355351, 1546381815, 1546381844, 1546381873, 
                      1546355462, 1546370527, 1546354015, 1546311063, 1546311154, 1546311746, 
                      1546313128, 1546308212, 1546308230, 1546308248, 1546308268, 1546308298, 
                      1546310346, 1546330603, 1546330620, 1546330637, 1546330650, 1546330676, 
                      1546334052, 1546334129, 1546334523, 1546334867, 1546334902), class = c("POSIXct", 
                                                                                             "POSIXt"), tzone = "UTC")

calls<-c(1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
         4, 1, 3, 4, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

Status<-c("Engaged", "Engaged", "Abandoned", "Abandoned", "Answered", 
          "Answered", "Answered", "Answered", "Answered", "Engaged", "Engaged", 
          "Engaged", "Answered", "Answered", "Engaged", "Engaged", "Answered", 
          "Answered", "Answered", "Answered", "Engaged", "Engaged", "Engaged", 
          "Abandoned", "Engaged", "Engaged", "Engaged", "Engaged", "Engaged", 
          "Answered", "Engaged", "Engaged", "Engaged", "Engaged", "Answered", 
          "Answered", "Answered", "Answered", "Answered", "Answered")

Successful<-c("N", "N", "N", "N", "Y", "Y", "Y", "Y", "Y", "N", "N", "N", 
              "Y", "Y", "N", "N", "Y", "Y", "Y", "Y", "N", "N", "N", "N", "N", 
              "N", "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "Y", "Y", "Y", 
              "Y", "Y")


df<-as.data.frame(cbind(UserId,OrigTime,LastTime,calls,Status,Successful))
df$OrigTime<-as.numeric(df$OrigTime)
df$OrigTime<-as.POSIXct(df$OrigTime, origin = "1970-01-01 00:00:00")
df$LastTime<-as.numeric(df$LastTime)
df$LastTime<-as.POSIXct(df$LastTime, origin = "1970-01-01 00:00:00")


And here is a few examples of what I am trying to achieve:-

Example 1

Here is UserId = "2203af12ce3e", as it is in df:-

Before

df%>%filter(UserId=="2203af12ce3e")
             OrigTime            LastTime calls    Status Successful       UserId
1 2019-01-01 03:23:59 2019-01-01 03:23:59     1   Engaged          N 2203af12ce3e
2 2019-01-01 03:24:30 2019-01-01 03:24:30     1   Engaged          N 2203af12ce3e
3 2019-01-01 03:25:16 2019-01-01 03:30:48     3 Abandoned          N 2203af12ce3e
4 2019-01-01 12:14:47 2019-01-01 12:14:47     1 Abandoned          N 2203af12ce3e

I want to collapse the rows which fall within 60 mins between the first the first LastTime and subsequent OrigTime, while tallying up the calls and showing the last Status. So in this case, I would collapse row 1-3 into a single row, with the calls tallied up and the last status as "Abandoned", which would look like this:-

Outcome

             OrigTime            LastTime calls    Status Successful       UserId
1 2019-01-01 03:23:59 2019-01-01 03:30:48     5 Abandoned          N 2203af12ce3e   
2 2019-01-01 12:14:47 2019-01-01 12:14:47     1 Abandoned          N 2203af12ce3e

Notice that the OrigTime in row 1 and the LastTime of row 3 in the Before segment are now in one row, the calls from rows 1-3 are tallied up and takes the last Status from row 3 because rows 1-3 all occur within 60 minute window. Row 4 is left as it is because it's OrigTime is >60 mins after the previous LastTime.

I want to give one more example just to be clear on what I am trying to do.

Example 2

If there are lots of "Abandoned" and "Engaged" factors before an "Answered" factor comes along and all occurring within a 60 minute window, then I want these rows to be collapsed with the final status of "Answered". Here is example 2:-

Before

df%>%filter(UserId=="ff10b8560791")
              OrigTime            LastTime calls   Status Successful       UserId
1  2019-01-01 02:03:32 2019-01-01 02:03:32     1  Engaged          N ff10b8560791
2  2019-01-01 02:03:50 2019-01-01 02:03:50     1  Engaged          N ff10b8560791
3  2019-01-01 02:04:08 2019-01-01 02:04:08     1  Engaged          N ff10b8560791
4  2019-01-01 02:04:28 2019-01-01 02:04:28     1  Engaged          N ff10b8560791
5  2019-01-01 02:04:58 2019-01-01 02:04:58     1  Engaged          N ff10b8560791
6  2019-01-01 02:05:17 2019-01-01 02:39:06     3 Answered          Y ff10b8560791
7  2019-01-01 08:16:43 2019-01-01 08:16:43     1  Engaged          N ff10b8560791
8  2019-01-01 08:17:00 2019-01-01 08:17:00     1  Engaged          N ff10b8560791
9  2019-01-01 08:17:17 2019-01-01 08:17:17     1  Engaged          N ff10b8560791
10 2019-01-01 08:17:30 2019-01-01 08:17:30     1  Engaged          N ff10b8560791
11 2019-01-01 08:17:56 2019-01-01 08:17:56     1 Answered          Y ff10b8560791
12 2019-01-01 09:14:12 2019-01-01 09:14:12     1 Answered          Y ff10b8560791
13 2019-01-01 09:15:29 2019-01-01 09:15:29     1 Answered          Y ff10b8560791
14 2019-01-01 09:22:03 2019-01-01 09:22:03     1 Answered          Y ff10b8560791
15 2019-01-01 09:27:47 2019-01-01 09:27:47     1 Answered          Y ff10b8560791
16 2019-01-01 09:28:22 2019-01-01 09:28:22     1 Answered          Y ff10b8560791

And as same as the previous example, I want to collapse rows that occur within 60 mins of the OrigTime and previous LastTime, tally up the calls column and give the last status of the row. However, here is the extra condition: if the row has an "Answered" status and the previous rows all occur within 60 mins of that "Answered" row, then I want to stop there and tally up the previous rows if they occur within 60 minute window and have the final status as "Answered". Here is the outcome for this example:-

Outcome

              OrigTime            LastTime calls   Status   Successful       UserId
1  2019-01-01 02:03:32 2019-01-01 02:39:06   8    Answered          Y   ff10b8560791
2  2019-01-01 08:16:43 2019-01-01 08:17:56   5    Answered          Y   ff10b8560791
3  2019-01-01 09:14:12 2019-01-01 09:14:12   1    Answered          Y   ff10b8560791
4  2019-01-01 09:15:29 2019-01-01 09:15:29   1    Answered          Y   ff10b8560791
5  2019-01-01 09:22:03 2019-01-01 09:22:03   1    Answered          Y   ff10b8560791
6  2019-01-01 09:27:47 2019-01-01 09:27:47   1    Answered          Y   ff10b8560791
7  2019-01-01 09:28:22 2019-01-01 09:28:22   1    Answered          Y   ff10b8560791

So in this case, rows 1-6 in the Before segment have been collapsed because they all occur within 60 minutes but stopped when the status was "Answered"; happens again for rows 7-11 and for the rest of the rows with the "Answered" status, this stays the same. I hope the Outcome segments that I have provided makes it clear what I am trying to do. Any help will be greatly appreciated :)

1 Answers1

1

We create a new group for each :

  1. UserId
  2. If status = 'Answered'
  3. 60 mins interval

For each group created we select

  1. First OrigTime
  2. Last LastTime
  3. sum of Calls
  4. Last Status
  5. Last Successful.
library(dplyr)
df %>%
  mutate(grp1 = lag(cumsum(Status == 'Answered'), default = 0)) %>%
  group_by(UserId) %>%
  mutate(grp2 = floor(as.numeric(difftime(LastTime, 
                      first(OrigTime), units = 'hours')))) %>%
  group_by(UserId, grp1, grp2) %>%
  summarise(OrigTime = first(OrigTime), 
            LastTime = last(LastTime), 
            calls = sum(calls), 
            Status = last(Status), 
            Successful = last(Successful)) %>%
  ungroup -> result

Check the result :

result %>% filter(UserId == '2203af12ce3e')
# UserId        grp1  grp2 OrigTime            LastTime            calls Status    Successful
#  <chr>        <dbl> <dbl> <dttm>              <dttm>              <dbl> <chr>     <chr>     
#1 2203af12ce3e     0     0 2019-01-01 11:23:59 2019-01-01 11:30:48     5 Abandoned N         
#2 2203af12ce3e     0     8 2019-01-01 20:14:47 2019-01-01 20:14:47     1 Abandoned N

result %>% filter(UserId == 'ff10b8560791')
# A tibble: 7 x 8
#  UserId        grp1  grp2 OrigTime            LastTime            calls Status   Successful
#  <chr>        <dbl> <dbl> <dttm>              <dttm>              <dbl> <chr>    <chr>     
#1 ff10b8560791    11     0 2019-01-01 10:03:32 2019-01-01 10:39:06     8 Answered Y         
#2 ff10b8560791    12     6 2019-01-01 16:16:43 2019-01-01 16:17:56     5 Answered Y         
#3 ff10b8560791    13     7 2019-01-01 17:14:12 2019-01-01 17:14:12     1 Answered Y         
#4 ff10b8560791    14     7 2019-01-01 17:15:29 2019-01-01 17:15:29     1 Answered Y         
#5 ff10b8560791    15     7 2019-01-01 17:22:03 2019-01-01 17:22:03     1 Answered Y         
#6 ff10b8560791    16     7 2019-01-01 17:27:47 2019-01-01 17:27:47     1 Answered Y         
#7 ff10b8560791    17     7 2019-01-01 17:28:22 2019-01-01 17:28:22     1 Answered Y

Time value is different because of different timezones we are in.

To construct dataframe don't cbind (as it converts data to matrix) use data.frame directly.

df<- data.frame(UserId,OrigTime,LastTime,calls,Status,Successful)
Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • 1
    Thank you for your reply and solution. After doing a lot of quality checks, I think that your solution works brilliantly and I think it's rather succinct as well; certainly more succinct than anything I could come up with! Once again, thank you for taking the time to fix this for me :) – Robin Turkington Sep 24 '20 at 10:14