I am trying to subset the maximum (minimum / whatever) value by groups. They are defined by more than one grouping variable.
My working workaround is to unite the grouping columns first (see desired output), but is there a more direct data.table syntax?
This is not an immediate duplicate to the famous questions:
https://stackoverflow.com/a/24558696/7941188 - because asking for grouping by one variable.
How to select the rows with maximum values in each group with dplyr? - because only dplyr solutions offered.
Cheers
library(tidyverse)
library(data.table)
set.seed(1)
mydf <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))
mydf$value <- runif(nrow(mydf))
mydf %>%
group_by(A, B) %>%
filter(value == max(value)) %>%
arrange(A, B, C)
#> # A tibble: 25 x 4
#> # Groups: A, B [25]
#> A B C value
#> <int> <int> <int> <dbl>
#> 1 1 1 4 0.892
#> 2 1 2 1 0.898
#> 3 1 3 5 0.976
#> 4 1 4 2 0.821
#> 5 1 5 5 0.992
#> 6 2 1 4 0.864
#> 7 2 2 1 0.945
#> 8 2 3 2 0.794
#> 9 2 4 1 0.718
#> 10 2 5 3 0.839
#> # … with 15 more rows
Desired output - is there a way to get that without creating the united column first?
mydt <- mydf %>%
arrange(A,B,C) %>%
unite("A_B", A, B) %>%
as.data.table()
mydt[mydt[, .I[value == max(value)], by = A_B]$V1] %>%
separate(A_B, LETTERS[1:2]) %>%
head(10)
#> A B C value
#> 1: 1 1 4 0.8921983
#> 2: 1 2 1 0.8983897
#> 3: 1 3 5 0.9761707
#> 4: 1 4 2 0.8209463
#> 5: 1 5 5 0.9918386
#> 6: 2 1 4 0.8643395
#> 7: 2 2 1 0.9446753
#> 8: 2 3 2 0.7942399
#> 9: 2 4 1 0.7176185
#> 10: 2 5 3 0.8394404
Created on 2020-04-21 by the reprex package (v0.3.0)