0

I have a data set in R that looks something link this:

ClientID Category Date
Person1 CategoryA 2020-09-01
Person1 CategoryA 2020-09-30
Person2 CategoryA 2020-07-25
Person2 CategoryA 2020-08-31
Person1 CategoryB 2020-03-15
Person1 CategoryB 2020-09-14
Person2 CategoryB 2020-06-17
Person2 CategoryB 2020-10-10

What I would like to do is filter it to only return the row of the most recent date in each category for every client, so it would look like this:

ClientID Category Date
Person1 CategoryA 2020-09-30
Person1 CategoryB 2020-09-14
Person2 CategoryA 2020-08-31
Person2 CategoryB 2020-10-10

I have tried

library(plyr)

dataSet %>% filter(Category == "CategoryA",Date == max(Date))

I knew this wouldn't work the moment I typed it, but I am lost on where to go. I have considered subsetting the data on the different categories (there are only 4), but then I am still lost with filtering by max date per client (because at least then, I could rbind() the results of each subset into a final data table). But alas, I am stuck.

Thanks in advance for the help.

3 Answers3

2

I guess you can try

df %>%
  group_by(ClientID, Category) %>%
  filter(Date == max(Date))

which gives

  ClientID Category  Date
  <chr>    <chr>     <date>
1 Person1  CategoryA 2020-09-30
2 Person2  CategoryA 2020-08-31
3 Person1  CategoryB 2020-09-14
4 Person2  CategoryB 2020-10-10

A base R option using subset + ave

subset(
  df,
  Date == ave(Date, ClientID, Category, FUN = max)
)

gives

  ClientID  Category       Date
2  Person1 CategoryA 2020-09-30
4  Person2 CategoryA 2020-08-31
6  Person1 CategoryB 2020-09-14
8  Person2 CategoryB 2020-10-10

A data.table option

> setDT(df)[, lapply(.SD, max), by = .(ClientID, Category)]
   ClientID  Category       Date
1:  Person1 CategoryA 2020-09-30
2:  Person2 CategoryA 2020-08-31
3:  Person1 CategoryB 2020-09-14
4:  Person2 CategoryB 2020-10-10

Data

> dput(df)
structure(list(ClientID = c("Person1", "Person1", "Person2", 
"Person2", "Person1", "Person1", "Person2", "Person2"), Category = c("CategoryA",
"CategoryA", "CategoryA", "CategoryA", "CategoryB", "CategoryB",
"CategoryB", "CategoryB"), Date = structure(c(18506, 18535, 18468,
18505, 18336, 18519, 18430, 18545), class = "Date")), row.names = c(NA,
-8L), class = "data.frame")
ThomasIsCoding
  • 53,240
  • 4
  • 13
  • 45
1

Try

library(dplyr)

dataSet %>% 
   group_by(Client_ID, Category) %>% 
   mutate(max_date=max(Date)) %>% 
   filter(Date==max_date)
coffeinjunky
  • 9,583
  • 32
  • 50
1

Another option is use of dplyr::slice. Thanks user ThomasIsCoding for the sample data.

I don't say it's better than the filter option, but it's just another way. One could also make this shorter by using which.max(Date) within slice, instead of arranging first.

library(dplyr)

foodf <- structure(list(ClientID = c("Person1", "Person1", "Person2", 
                            "Person2", "Person1", "Person1", "Person2", "Person2"), Category = c("CategoryA",
                                                                                                 "CategoryA", "CategoryA", "CategoryA", "CategoryB", "CategoryB",
                                                                                                 "CategoryB", "CategoryB"), Date = structure(c(18506, 18535, 18468,
                                                                                                                                               18505, 18336, 18519, 18430, 18545), class = "Date")), row.names = c(NA,
                                                                                                                                                                                                                   -8L), class = "data.frame")
foodf %>% 
  arrange(ClientID, Category, Date) %>%
  group_by(ClientID, Category) %>%
  slice(n())
#> # A tibble: 4 x 3
#> # Groups:   ClientID, Category [4]
#>   ClientID Category  Date      
#>   <chr>    <chr>     <date>    
#> 1 Person1  CategoryA 2020-09-30
#> 2 Person1  CategoryB 2020-09-14
#> 3 Person2  CategoryA 2020-08-31
#> 4 Person2  CategoryB 2020-10-10
tjebo
  • 12,885
  • 4
  • 34
  • 61