I am trying to filter a data frame that contains n rows for n categories. I want that each category dimension values are sorted by another column revenues and then, the top 10 values of each dimension are selected and get rid of the rest.

I did an attempt with the following snipper, but it does not seem to achieve what I want to:

data <-  tbl_df(data) %>%
  arrange(revenues) %>%
  group_by(dimension) %>%
  • could you share the source of error ? Otherwise we won't be able to help you – linog Apr 10 '20 at 11:38
  • I don't get an error. I am just not getting the desired output. – sanna Apr 10 '20 at 11:42
  • 1
    Does this answer your question? [Getting the top values by group](https://stackoverflow.com/questions/27766054/getting-the-top-values-by-group) – DJJ Apr 10 '20 at 15:23
  • It seems to be a [duplicate](https://stackoverflow.com/questions/27766054/getting-the-top-values-by-group). When posting a question I would recommend avoiding putting `dplyr` at the front of the title unless you are a master of `dplyr`. It might save you having to write the question entirely. – DJJ Apr 10 '20 at 15:30

3 Answers3

data <-  tbl_df(data) %>%
  group_by(dimension) %>%
  arrange(revenues, .by_group = TRUE) %>%
  • You need to specify which column to be ordered by in `top_n()`, or it'll take the default to the last variable in the data. – Darren Tsai Apr 10 '20 at 12:04
data <-  tbl_df(data) %>%
  group_by(dimension) %>%
  arrange(desc(revenues),.by_group=TRUE) %>%
    Welcome to stackoverflow and thanks for your sharing your answer. To make your answer complete It is essential you add some explanation to your answer, Feel free give some identify what was the issue with the original post.
  Thank you and I shall keep in mind to add necessary explanation going forward.

We can test it with an example data:

data = data.frame(revenues=rnbinom(100,mu=1000,size=1),

Firstly as @DarrenTsai correctly pointed out, you need to specify the column to do top_n(). Secondly, when you use top_n, it goes by descending order and takes the entries with rank 1-10:

data %>% top_n(10,revenues)
   revenues dimension
1      4191         b
2      1916         a
3      2397         b
4      1895         a
5      2013         a
6      2351         b
7      3889         b
8      2503         a
9      3909         a
10     2779         b

This means you don't need to arrange your data, and I am not sure whether you intend to take it in descending or ascending. Let's assume it is descending, :

data %>%  group_by(dimension) %>% top_n(10,revenues)

Note, this code above will take the top 10 values, meaning in events of ties (say you have 2 ranked 1st), you will get more than 10. For example in this data:

# A tibble: 21 x 2
# Groups:   dimension [2]
   revenues dimension
      <dbl> <fct>    
 1     1663 a        
 2     1663 a        
 3     1753 a        
 4     1849 a        
 5     1856 a        
 6     1869 a        
 7     1895 a        
 8     1916 a        
 9     2013 a        
10     2503 a        
# … with 11 more rows

We can see whether the results are correct, this is what we expect:

  a1   a2   a3   a4   a5   a6   a7   a8   a9  a10   b1   b2   b3   b4   b5   b6 
3909 2503 2013 1916 1895 1869 1856 1849 1753 1663 4191 3889 2779 2397 2351 1479 
  b7   b8   b9  b10 
1414 1340 1327 1274 

And using the group_by + top_n() :

 data %>%  group_by(dimension) %>% top_n(10,revenues) %>% 
arrange(dimension,desc(revenues)) %>% pull(revenues)
 [1] 3909 2503 2013 1916 1895 1869 1856 1849 1753 1663 1663 4191 3889 2779 2397
[16] 2351 1479 1414 1340 1327 1274

You can see 1663 is taken twice, giving 21 values in total.

If you need absolutely 20 (10 each):

data %>% arrange(desc(revenues)) %>%
group_by(dimension) %>% do(head(.,10))
