2

I have a data frame with just one column, I want to find the largest three values with it's index. For example, my data frame df looks like:

  distance
1 1
2 4
3 2
4 3
5 4
6 5
7 5

I want to find the largest 3 value with its index, so my expected result is:

  distance    
6 5
7 5
2 4
5 4
4 3

How can I do this? Since I have just one column, is it also possible with list instead of data frame?

xirururu
  • 3,852
  • 6
  • 28
  • 48

7 Answers7

8

We can use sort with index.return=TRUE to return the value with the index in a list. Then we can subset the list based on the first 3 unique elements in the 'x'.

lst <- sort(df1$distance, index.return=TRUE, decreasing=TRUE)
lapply(lst, `[`, lst$x %in% head(unique(lst$x),3))
#$x
#[1] 5 5 4 4 3

#$ix
#[1] 6 7 2 5 4
akrun
  • 674,427
  • 24
  • 381
  • 486
  • Thanks very much for the answer. But I don't know in advance, how many values can be returned. It may be 5 or 4 or 3.... – xirururu Sep 14 '15 at 13:18
  • Hi akrun, I aware that, you use `[` in lapply. What means the `[`? – xirururu Sep 14 '15 at 13:40
  • 1
    @xirururu It is just to subset the dataset based on the index returned from `list$x %in% head(unique..`. without using a anonymous function. It can be otherwise written as `lapply(lst, function(y) y[lst$x %in% head(unique(lst$x),3)])` – akrun Sep 14 '15 at 13:42
  • 1
    @xirururu You can find more info from `?Extract` or `?"["` – akrun Sep 14 '15 at 13:43
  • 1
    Hi akrun, thank very much! :D I am now on `?Extract` page. It is really cool, I can learn so much just from a small question. :D – xirururu Sep 14 '15 at 13:45
  • @xirururu Glad to know that it helped. BTW, I used `sort` with `index.return` as it is more specific rather than depending on the numeric row names. – akrun Sep 14 '15 at 13:46
2

A little clumsy version of my previous code:

 df[order(df$distance, decreasing = TRUE)[sort(unique(df$distance))], , drop = FALSE]
  distance
6        5
7        5
2        4
5        4
4        3
SabDeM
  • 6,638
  • 2
  • 22
  • 37
1
df[order(df, decreasing=TRUE)[1:3],,drop=FALSE]

If you have more columns, then you should have

 df[order(df$column_name, decreasing=TRUE)[1:3],,drop=FALSE]
Theodor
  • 896
  • 3
  • 7
  • 20
  • Hi Theodor, thanks for the answer, but I got the result: 5, 5, 4. Acturally, I want 3 distince values, so the top 3 values is 5,5,4,3. Do you know, how can I do this? – xirururu Sep 14 '15 at 13:24
1

If you are looking for one column to sort from increasing to decreasing order

rownames = rownames(df)
indexes <- order(df$ColumnName,decreasing = TRUE)[1:N]

result <- NULL
for (i in indexes)
  result<- c(rownames[i],result)

result

Here, we have saved the rownames in 'result' vector. This will return the indexes as well.

1

Using the libaray data.table is a faster solution because setorder is faster than order and sort:

library(data.table)

select_top_n<-function(scores,n_top){
    d <- data.frame(
        x   = copy(scores),
        indice=seq(1,length(scores)))
    
    setDT(d)
    setorder(d,-x)
    n_top_indice<-d$indice[1:n_top]
    return(n_top_indice)
}


select_top_n2<-function(scores,n_top){
    
    n_top_indice<-order(-scores)[1:n_top]
    return(n_top_indice)
}

select_top_n3<-function(scores,n_top){
    
    n_top_indice<-sort(s, index.return=TRUE, decreasing=TRUE)$ix[1:n_top]
    return(n_top_indice)
}

Testing:

set.seed(123)
s=runif(100000)

library(microbenchmark)
mbm<-microbenchmark(
    ind1 = select_top_n(s,100),
    ind2=select_top_n2(s,100),
    ind3=select_top_n3(s,100),
    times = 10L
)

Output:

Unit: milliseconds
 expr       min       lq      mean    median        uq       max neval
 ind1  5.824576  5.98959  6.209746  6.052658  6.270312  7.422736    10
 ind2  9.627950 10.08661 10.274867 10.377451 10.560912 10.588223    10
 ind3 10.397383 11.32129 12.087122 12.498817 12.856840 13.155845    10

Refer to Getting the top values by group

Minstein
  • 187
  • 1
  • 8
1

You can use function nth from package Rfast for getting the index or the values

> x=runif(100000)
> num.of.nths <- 3
> Rfast2::benchmark(a<-Rfast::nth(x,3,num.of.nths,TRUE,TRUE),b<-order(x,decreasing = T)[1:3],times = 10)
   milliseconds 
                                        min     mean     max
a <- Rfast::nth(x, 3, 3, TRUE, TRUE) 1.6483  2.12419  3.1238
b <- order(x, decreasing = T)[1:3]   6.8648 12.31633 27.1988
> 
> a
      [,1]
[1,]  8058
[2,] 63946
[3,] 17556
> b
[1]  8058 63946 17556
Manos Papadakis
  • 506
  • 3
  • 13
0

Get top percentage (proportion) of any column

df <- df %>% slice_max(IndexCol, prop = .25)

or by a group

df <- df %>% group_by(col1, col2) %>% slice_max(IndexCol, prop = .25)

https://dplyr.tidyverse.org/reference/slice.html