Remove duplicates keeping entry with largest absolute value

Question

Let's say I have four samples: id=1, 2, 3, and 4, with one or more measurements on each of those samples:

> a <- data.frame(id=c(1,1,2,2,3,4), value=c(1,2,3,-4,-5,6))
> a
  id value
1  1     1
2  1     2
3  2     3
4  2    -4
5  3    -5
6  4     6

I want to remove duplicates, keeping only one entry per ID - the one having the largest absolute value of the "value" column. I.e., this is what I want:

> a[c(2,4,5,6), ]
  id value
2  1     2
4  2    -4
5  3    -5
6  4     6

How might I do this in R?

You mention "*keeping only one entry per ID - the one having the largest absolute value of the 'value' column.*" What is the desired behavior if more than one entry per ID matches that condition? Return both values, or either one? For example, what's your desired output if `a[3, 2] — A5C1D2H2I1M1N2O1R2T1, Oct 09 '12 at 19:16
Ah.. that's a good question. The value column is a actually real number not an integer, and will very likely never be exactly equal. Ideal desired behavior should probably be to discard both observations, but this probably won't happen as I said. — Stephen Turner, Oct 09 '12 at 21:06

IRTFM · Accepted Answer · 2019-11-20T23:37:12.940

43

First. Sort in the order putting the less desired items last within I’d groups

 aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)

Then: Remove items after the first within I’d groups

 aa[ !duplicated(aa$id), ]              # take the first row within each id
  id value
2  1     2
4  2    -4
5  3    -5
6  4     6

edited Nov 20 '19 at 23:37

answered Oct 09 '12 at 18:24

IRTFM

240,863
19
328
451

score 13 · Answer 2 · edited Oct 09 '12 at 19:12

A data.table approach might be in order if your data set is very large:

library(data.table)

aDT <- as.data.table(a)
setkey(aDT,"id")

aDT[J(unique(id)), list(value = value[which.max(abs(value))])]

Or a not as fast, but still fast, alternative :

library(data.table)
as.data.table(a)[, .SD[which.max(abs(value))], by=id]

This version returns all the columns of a, in case there are more in the real dataset.

nograpes · Answer 3 · 2012-10-10T15:44:14.490

9

Check out ?aggregate:

aggregate(value~id,a,function(x) x[which.max(abs(x))])

I like the answer by @DWin, but I would like show how this could also work with metadata:

aa<-merge(aggregate(value~id,a,function(x) x[which.max(abs(x))]),a)
# Fails if the max value is duplicated for a single id without next line.
aa[!duplicated(aa),]

I couldn't help myself and created one last answer:

do.call(rbind,lapply(split(a,a$id),function(x) x[which.max(abs(x$value)),]))

edited Oct 10 '12 at 15:44

answered Oct 09 '12 at 18:19

nograpes

17,804
1
39
62

This works well as per my description, but I should have been more informative. There is actually a single ID and many other metadata columns that are the same for each ID, and many other value columns for each ID. I want to keep all of the columns in the data frame, not just the one id and value. – Stephen Turner Oct 10 '12 at 12:31

score 9 · Answer 4 · answered May 25 '18 at 10:15

9

Here is a dplyr approach

library(dplyr)
a %>% 
        group_by(id) %>%
        top_n(1, abs(value))

# A tibble: 4 x 2
# Groups:   id [4]
#     id value
#  <dbl> <dbl>
#1     1     2
#2     2    -4
#3     3    -5
#4     4     6

answered May 25 '18 at 10:15

nghauran

6,022
2
14
23

2

Just in case someone else needs to use -1 to get the min instead. `top_n(-1, abs(value))` – Ahdee Feb 15 '20 at 17:52
It's handy, but could someone explain me the logic of this? – Negrito Apr 07 '20 at 08:52
1

@Negrito `top_n()` is wrapper to select the top or bottom entries per group. More details [here](https://dplyr.tidyverse.org/reference/top_n.html). Here, we are interested in the largest (`top_n(n = 1,...)`) absolute value for the "value" column `top_n(..., wt = abs(value))` – nghauran Apr 08 '20 at 07:26

score 5 · Answer 5 · answered Oct 09 '12 at 18:56

5

Another approach (though the code might look a little cumbersome) is to use ave():

a[which(abs(a$value) == ave(a$value, a$id, 
                            FUN=function(x) max(abs(x)))), ]
#   id value
# 2  1     2
# 4  2    -4
# 5  3    -5
# 6  4     6

answered Oct 09 '12 at 18:56

A5C1D2H2I1M1N2O1R2T1

177,446
27
370
450

@DWin, ["I learned it by watching you!"](http://www.youtube.com/watch?v=Y-Elr5K2Vuo). ;) – A5C1D2H2I1M1N2O1R2T1 Oct 10 '12 at 03:17
Heh. Not from my cultural framework, but the Scrubs outtakes I got to after the "brain on drugs" clip at youtube were curiously amusing. – IRTFM Oct 10 '12 at 06:02

score 3 · Answer 6 · answered Oct 09 '12 at 18:21

3

library(plyr)
ddply(a, .(id), function(x) return(x[which(abs(x$value)==max(abs(x$value))),]))

answered Oct 09 '12 at 18:21

Maiasaura

29,590
23
96
103

plyr is terribly slow – chupvl Mar 25 '15 at 13:36

score 1 · Answer 7 · answered Nov 14 '18 at 22:51

1

You can do this with dplyr as follows:

library(dplyr)
a %>%
  group_by(name) %>%
  filter(n == max(n)) %>%
  ungroup()

answered Nov 14 '18 at 22:51

dmca

633
1
8
16

Remove duplicates keeping entry with largest absolute value

7 Answers7

Linked

Related