1

I have a data frame, say

df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
             y = c(1,1,1,1,1,2,3,1,1,2,3,4),
             z = c("a","b","c","d","e","f","g","h","i","j","k","l"))

it looks like this

   x y z
1  1 1 a
2  2 1 b
3  5 1 c
4  6 1 d
5  3 1 e
6  3 2 f
7  3 3 g
8  6 1 h
9  8 1 i
10 8 2 j
11 8 3 k
12 8 4 l

I would like pick unique elements from column x, based on column y such that y should be maximum (in this case say for row number 5 to 7 are 3'3, I would like to pick the x = 3 corresponding to y = 3 (maximum value) similarly for x = 8 I d like to pick y = 4 row )

the output should look like this

  x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 3 g
6 6 1 h
7 8 4 l

I have a solution for that, which I am posting in the solution, but if there is there any better method to achieve this, My solution only works in this specific case (picking the largest) what is the general case solution for this?

Kou
  • 145
  • 9

5 Answers5

2

One solution using dplyr

library(dplyr) 
df %>% 
 group_by(x) %>% 
 slice(max(y))

#      x     y     z
#  (dbl) (dbl) (chr)
#1     1     1     a
#2     2     1     b
#3     3     3     g
#4     5     1     c
#5     6     1     d
#6     8     4     l

The base R alternative is using aggregate

aggregate(y~x, df, max)
Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • I would probably `arrange` it too, if you reverse the observations the result will differ. – J.R. Aug 30 '16 at 17:59
2

You can achieve the same result using a dplyr chain and dplyr's group_by function. Once you use a group_by function the rest of the functions in the chain are applied within group as opposed to the whole data.frame. So here I filter to where the only rows left are the max(y) per the grouping value of x. This can be extended to be used for the min of y or a particular value.

I think its generally good practice to ungroup the data at the end of a chain using group_by to avoid any unexpected behavior.

library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
             y = c(1,1,1,1,1,2,3,1,1,2,3,4),
             z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df %>% 
  group_by(x) %>% 
  filter(y==max(y)) %>% 
  ungroup()

To make it more general... say instead you wanted the mean of y for a given x as opposed to the max. You could then use the summarise function instead of the filter as shown below.

df %>% 
  group_by(x) %>% 
  summarise(y=mean(y)) %>% 
  ungroup()
Adam Spannbauer
  • 2,397
  • 1
  • 14
  • 25
1

Using data.table we can use df[order(z), .I[which.max(y)], by = x] to get the rownumbers of interest, eg:

library(data.table)
setDT(df)
df[df[order(z), .I[which.max(y)], by = x][, V1]]

   x y z
1: 1 1 a
2: 2 1 b
3: 5 1 c
4: 6 1 d
5: 3 3 g
6: 8 4 l
J.R.
  • 3,628
  • 1
  • 16
  • 22
0

Here is my solution using dplyr package

library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
                 y = c(1,1,1,1,1,2,3,1,1,2,3,4),
                 z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df <- arrange(df,desc(y))
df_out <- df[!duplicated(df$x),]
df_out

Printing df_out

  x y z
1 8 4 l
2 3 3 g
6 1 1 a
7 2 1 b
8 5 1 c
9 6 1 d
Kou
  • 145
  • 9
0

Assuming the data frame is ordered by df[order(df$x, df$y),] as it is in the example, you can use base R functions, split, lapply, and do.call/rbind to extract your desired rows using the "split / apply / combine" methodology.

do.call(rbind, lapply(split(df, df$x), function(i) i[nrow(i),]))
  x y z
1 1 1 a
2 2 1 b
3 3 3 g
5 5 1 c
6 6 1 h
8 8 4 l

split breaks up the data.frame into a list based on x. This list is fed to lapply which selects the last row of each data.frame, and returns these one row data.frames as a list. This list is then rbinded into a single data frame using do.call.

josliber
  • 41,865
  • 12
  • 88
  • 126
lmo
  • 35,764
  • 9
  • 49
  • 57