Select rows with maximum value within each group in R

Question

I would like to extract the rows with pre.post == "pre" and the maximum value on "year" within each participant (ID) in a data frame.

My data looks like this:

df <- read.table(text="pre.post     ID  year
                      post        800033  2011 
                       pre        800033  2009 
                      post        800033  2011 
                       pre        800033  2010 
                      post        800076  2009 
                       pre        800076  2008 
                      post        800076  2009 
                       pre        800076  2007 
                       pre        800100  2018 
                      post        800125  2009 
                       pre        800125  2008 
                      post        800222  2010 
                      post        800223  2013 
                      pre         800223  2011", header = T)

In the final data set I wish to have only one row per person. This row should satisfy pre.post == "pre" and the value of year should be the maximum value for year within this person.

For example, for ID == 800033, I aim to have only row for (with year == 2010).

Individuals without any pre-value (e.g. ID == 800222) should be deleted in the final data set. Individuals with only one pre-value (e.g. ID == 800223 and ID == 800100) the only available row should be kept.

The final data set should look like this:

df.1 <- read.table(text="pre.post     ID  year pre.post outcome1 outcome2

                 1                 pre 800033  2010 A           3  4
                 2                 pre 800076  2008 B           2  7
                 3                 pre 800100  2018 C           3  4
                 4                 pre 800125  2008 A           2  8
                 5                 pre 800223  2011 C           4  NA", header = T)

I already tried this code (using dplyr):

df %>% group_by(ID) %>% filter(pre.post == "pre") %>% summarise(year = max(year))

However, I then lose the other variables and my new data set just includes ID and year.

Do you have advice on how to get this done?

I'd begin here df[df[,1] == "pre",] and you could amend with Boolean code but ... I'm not sure exactly what your outcome row looks like from your description? run the code presented and tell us what you want in the next step. — user1945827, May 08 '19 at 09:00
I edited my question. It now shows how my final data set should look like. — Marie B., May 08 '19 at 09:13
@akrun I'm not targeting posts of specific users. So, stop wining about that and also stop linking to posts that are not relevant for this question. — Jaap, May 08 '19 at 09:27
@Jaap: I'm sorry, but I studied the other posts already and did not find a solution for my problem. Would you be willing to point me into the right direction where to look? Your help is highly appreciated! — Marie B., May 08 '19 at 09:28
you probably need `df %>% filter(pre.post == "pre") %>% group_by(ID) %>% slice(which.max(year))` I don't know how you are getting `outcome1` and `outcome2` in the expected output. — Ronak Shah, May 08 '19 at 09:31
@Ronak Shah: Thank you! Your solution using "slice(which.max(year))" did the trick! Thanks a lot! — Marie B., May 08 '19 at 09:35
@MarieB. The issue was in the way you described the problem. Now you can check my post — akrun, May 08 '19 at 09:38

akrun · Accepted Answer · 2019-05-08T09:45:44.107

2

One option would be

library(dplyr)
df %>% 
  filter(pre.post == "pre") %>%
  group_by(ID) %>%
  filter(year == max(year))

edited May 08 '19 at 09:45

answered May 08 '19 at 08:57

akrun

674,427
24
381
486

I tried this. However, I then receive a data frame with just one row (the one with year == 2018, as this was the overall maximum value on the variable year). – Marie B. May 08 '19 at 09:05
@MarieB. Your desription is `I would like to extract the rows with pre.post == "pre" and the maximum value on "year" within each participant (ID) in a data frame.` – akrun May 08 '19 at 09:06
In the final data set I wish to have only one row per person. This row should satisfy `pre.post == "pre"` and the value of year should be the maximum value for year within this person. – Marie B. May 08 '19 at 09:09

Select rows with maximum value within each group in R

1 Answers1