1

I am trying to subset the maximum (minimum / whatever) value by groups. They are defined by more than one grouping variable.

My working workaround is to unite the grouping columns first (see desired output), but is there a more direct data.table syntax?

This is not an immediate duplicate to the famous questions:

https://stackoverflow.com/a/24558696/7941188 - because asking for grouping by one variable.

How to select the rows with maximum values in each group with dplyr? - because only dplyr solutions offered.

Cheers

library(tidyverse)
library(data.table)

set.seed(1)
mydf <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))
mydf$value <- runif(nrow(mydf))

mydf %>%
  group_by(A, B) %>%
  filter(value == max(value)) %>%
  arrange(A, B, C)
#> # A tibble: 25 x 4
#> # Groups:   A, B [25]
#>        A     B     C value
#>    <int> <int> <int> <dbl>
#>  1     1     1     4 0.892
#>  2     1     2     1 0.898
#>  3     1     3     5 0.976
#>  4     1     4     2 0.821
#>  5     1     5     5 0.992
#>  6     2     1     4 0.864
#>  7     2     2     1 0.945
#>  8     2     3     2 0.794
#>  9     2     4     1 0.718
#> 10     2     5     3 0.839
#> # … with 15 more rows

Desired output - is there a way to get that without creating the united column first?

mydt <- mydf %>%
  arrange(A,B,C) %>%
  unite("A_B", A, B) %>%
  as.data.table()

mydt[mydt[, .I[value == max(value)], by = A_B]$V1] %>%
  separate(A_B, LETTERS[1:2]) %>% 
  head(10)
#>     A B C     value
#>  1: 1 1 4 0.8921983
#>  2: 1 2 1 0.8983897
#>  3: 1 3 5 0.9761707
#>  4: 1 4 2 0.8209463
#>  5: 1 5 5 0.9918386
#>  6: 2 1 4 0.8643395
#>  7: 2 2 1 0.9446753
#>  8: 2 3 2 0.7942399
#>  9: 2 4 1 0.7176185
#> 10: 2 5 3 0.8394404

Created on 2020-04-21 by the reprex package (v0.3.0)

tjebo
  • 12,885
  • 4
  • 34
  • 61
  • @ronakshah would you care adding your solution to the more prominent thread for better visibility? – tjebo Jan 30 '21 at 12:45

1 Answers1

1

You can compare value with max value in A and B, extract the logical vector and use it to subset data.table.

library(data.table)

setDT(mydf)
mydf[mydf[, value == max(value), .(A, B)]$V1, ]
Ronak Shah
  • 286,338
  • 16
  • 97
  • 143