0

I want to group by data set based on some IDs, then leave the grouped data that has largest value in the column. Here is a description of my data set.

   BSTN ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5  ASTN TRNID TRNID2 TRNID3 TRNID4 TRNID5 count
1   150     0     0     0     0     0     0     0     0   152  1674      0      0      0      0     1
2   150     0     0     0     0     0     0     0     0   152  1676      0      0      0      0     2
3   150     0     0     0     0     0     0     0     0   152  1678      0      0      0      0     2
4   150     0     0     0     0     0     0     0     0   152  1680      0      0      0      0    13
5   150     0     0     0     0     0     0     0     0   152  1682      0      0      0      0     3
6   150     0     0     0     0     0     0     0     0   152  1684      0      0      0      0     4

I want to group and summarise this data into a single row based on IDs the first 10 columns BSTN ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN.
Then for the rest of the columns, TRNID TRNID2 TRNID3 TRNID4 TRNID5 I would like to replace them with the row with maximum value in column count.

What I want as my final output would look as below.

BSTN ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5  ASTN TRNID TRNID2 TRNID3 TRNID4 TRNID5 count
 150   0     0     0     0     0     0     0     0    152  1680     0      0      0      0    13

How would summarise my data? I have 2,931,959 rows with more groups of BSTN, ASTNs.

dput(head(A_Routetable2))
structure(list(BSTN = c(150, 150, 150, 150, 150, 150), ASTN1 = c(0, 
0, 0, 0, 0, 0), BSTN2 = c(0, 0, 0, 0, 0, 0), ASTN2 = c(0, 0, 
0, 0, 0, 0), BSTN3 = c(0, 0, 0, 0, 0, 0), ASTN3 = c(0, 0, 0, 
0, 0, 0), BSTN4 = c(0, 0, 0, 0, 0, 0), ASTN4 = c(0, 0, 0, 0, 
0, 0), BSTN5 = c(0, 0, 0, 0, 0, 0), ASTN = c(152, 152, 152, 152, 
152, 152), TRNID = c(1674, 1676, 1678, 1680, 1682, 1684), TRNID2 = c(0, 
0, 0, 0, 0, 0), TRNID3 = c(0, 0, 0, 0, 0, 0), TRNID4 = c(0, 0, 
0, 0, 0, 0), TRNID5 = c(0, 0, 0, 0, 0, 0), count = c(1L, 2L, 
2L, 13L, 3L, 4L)), row.names = c(NA, -6L), groups = structure(list(
    BSTN = c(150, 150, 150, 150, 150, 150), ASTN1 = c(0, 0, 0, 
    0, 0, 0), BSTN2 = c(0, 0, 0, 0, 0, 0), ASTN2 = c(0, 0, 0, 
    0, 0, 0), BSTN3 = c(0, 0, 0, 0, 0, 0), ASTN3 = c(0, 0, 0, 
    0, 0, 0), BSTN4 = c(0, 0, 0, 0, 0, 0), ASTN4 = c(0, 0, 0, 
    0, 0, 0), BSTN5 = c(0, 0, 0, 0, 0, 0), ASTN = c(152, 152, 
    152, 152, 152, 152), TRNID = c(1674, 1676, 1678, 1680, 1682, 
    1684), TRNID2 = c(0, 0, 0, 0, 0, 0), TRNID3 = c(0, 0, 0, 
    0, 0, 0), TRNID4 = c(0, 0, 0, 0, 0, 0), .rows = structure(list(
        1L, 2L, 3L, 4L, 5L, 6L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, 6L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

1 Answers1

2

You can group_by position and then select row with max value in count.

library(dplyr)
df %>% group_by(across(1:10)) %>% slice(which.max(count))

#   BSTN ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5  ASTN TRNID TRNID2 TRNID3 TRNID4 TRNID5 count
#  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>  <int>  <int>  <int>  <int> <int>
#1   150     0     0     0     0     0     0     0     0   152  1680      0      0      0      0    13

Or group by range of columns

df %>% group_by(across(BSTN:ASTN)) %>%slice(which.max(count))

The dput shared by OP is grouped which results an error with across. We can ungroup the data first and run the above which runs without any error. However functions in the previous version of dplyr work without any error on it. For example - group_by_at

A_Routetable2 %>% group_by_at(1:10) %>% slice(which.max(count))
Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • Thank you Ronak for so much help, I remember that you have answered this problem before but I couldn't solve it yet. – Yun Hyunsoo Jun 30 '20 at 06:34
  • A_Routetable2 %>% group_by(across(1:10)) %>% slice(which.max(count)) error: Problem with `mutate()` input `..1`. x Can't subset columns that don't exist. x Locations 3, 4, 5, 6, 7, etc. don't exist. i There are only 2 columns. i Input `..1` is `across(1:10)`. i The error occured in group 1: BSTN = 150, ASTN1 = 0, BSTN2 = 0, ASTN2 = 0, BSTN3 = 0, ASTN3 = 0, BSTN4 = 0, ASTN4 = 0, BSTN5 = 0, ASTN = 152, TRNID = 1674, TRNID2 = 0, TRNID3 = 0, TRNID4 = 0. – Yun Hyunsoo Jun 30 '20 at 06:35
  • @YunHyunsoo The error message states that you have only 2 columns in `A_Routetable2`. What is `ncol(A_Routetable2)` ? Are you sure you have the same data as shown? Maybe share your data with `dput(head(A_Routetable2))` ? – Ronak Shah Jun 30 '20 at 06:37
  • it is 16. The data described are first 6 columns of the original dataset – Yun Hyunsoo Jun 30 '20 at 06:39
  • maybe it is because some groups of the IDs in BSTN~ASTN have the same "count"? – Yun Hyunsoo Jun 30 '20 at 06:42
  • No, it does not matter if the `count` is same. This will select `max column for each group irrespective. Can you add data using `dput(head(A_Routetable2))` ? – Ronak Shah Jun 30 '20 at 06:48
  • I have added them in my edits thank you – Yun Hyunsoo Jun 30 '20 at 06:50
  • 1
    @YunHyunsoo your data is already grouped, we need to ungroup it first so try `A_Routetable2 %>% ungroup %>% group_by(across(1:10)) %>% slice(which.max(count))` which works for your data. – Ronak Shah Jun 30 '20 at 06:51