Picking up only specific columns based on conditions on multiple columns in R

Question

I have a data frame, say

df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
             y = c(1,1,1,1,1,2,3,1,1,2,3,4),
             z = c("a","b","c","d","e","f","g","h","i","j","k","l"))

it looks like this

I would like pick unique elements from column x, based on column y such that y should be maximum (in this case say for row number 5 to 7 are 3'3, I would like to pick the x = 3 corresponding to y = 3 (maximum value) similarly for x = 8 I d like to pick y = 4 row )

the output should look like this

I have a solution for that, which I am posting in the solution, but if there is there any better method to achieve this, My solution only works in this specific case (picking the largest) what is the general case solution for this?

Ronak Shah · Accepted Answer · 2016-08-30T17:28:51.410

2

One solution using dplyr

library(dplyr) 
df %>% 
 group_by(x) %>% 
 slice(max(y))

#      x     y     z
#  (dbl) (dbl) (chr)
#1     1     1     a
#2     2     1     b
#3     3     3     g
#4     5     1     c
#5     6     1     d
#6     8     4     l

The base R alternative is using aggregate

aggregate(y~x, df, max)

edited Aug 30 '16 at 17:28

answered Aug 30 '16 at 17:23

Ronak Shah

286,338
16
97
143

I would probably `arrange` it too, if you reverse the observations the result will differ. – J.R. Aug 30 '16 at 17:59

score 2 · Answer 2 · answered Aug 30 '16 at 17:25

You can achieve the same result using a dplyr chain and dplyr's group_by function. Once you use a group_by function the rest of the functions in the chain are applied within group as opposed to the whole data.frame. So here I filter to where the only rows left are the max(y) per the grouping value of x. This can be extended to be used for the min of y or a particular value.

I think its generally good practice to ungroup the data at the end of a chain using group_by to avoid any unexpected behavior.

library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
             y = c(1,1,1,1,1,2,3,1,1,2,3,4),
             z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df %>% 
  group_by(x) %>% 
  filter(y==max(y)) %>% 
  ungroup()

To make it more general... say instead you wanted the mean of y for a given x as opposed to the max. You could then use the summarise function instead of the filter as shown below.

df %>% 
  group_by(x) %>% 
  summarise(y=mean(y)) %>% 
  ungroup()

J.R. · Answer 3 · 2016-08-31T04:46:28.210

1

Using data.table we can use df[order(z), .I[which.max(y)], by = x] to get the rownumbers of interest, eg:

library(data.table)
setDT(df)
df[df[order(z), .I[which.max(y)], by = x][, V1]]

   x y z
1: 1 1 a
2: 2 1 b
3: 5 1 c
4: 6 1 d
5: 3 3 g
6: 8 4 l

edited Aug 31 '16 at 04:46

answered Aug 30 '16 at 18:26

J.R.

3,628
1
16
22

Kou · Answer 4 · 2016-08-30T17:36:03.047

0

Here is my solution using dplyr package

library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
                 y = c(1,1,1,1,1,2,3,1,1,2,3,4),
                 z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df <- arrange(df,desc(y))
df_out <- df[!duplicated(df$x),]
df_out

Printing df_out

edited Aug 30 '16 at 17:36

answered Aug 30 '16 at 17:13

Kou

145
9

1

`arrange` and `desc` are not R base functions, you should mention the package they belong to. – Jilber Urbina Aug 30 '16 at 17:25
@Rohak, corrected the output ; and Jilber, added package info – Kou Aug 30 '16 at 17:37

score 0 · Answer 5 · edited Sep 15 '17 at 14:28

Assuming the data frame is ordered by df[order(df$x, df$y),] as it is in the example, you can use base R functions, split, lapply, and do.call/rbind to extract your desired rows using the "split / apply / combine" methodology.

do.call(rbind, lapply(split(df, df$x), function(i) i[nrow(i),]))
  x y z
1 1 1 a
2 2 1 b
3 3 3 g
5 5 1 c
6 6 1 h
8 8 4 l

split breaks up the data.frame into a list based on x. This list is fed to lapply which selects the last row of each data.frame, and returns these one row data.frames as a list. This list is then rbinded into a single data frame using do.call.

Picking up only specific columns based on conditions on multiple columns in R

5 Answers5