196

Suppose I have pandas DataFrame like this:

>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
   id  value
0   1      1
1   1      2
2   1      3
3   2      1
4   2      2
5   2      3
6   2      4
7   3      1
8   4      1

I want to get a new DataFrame with top 2 records for each id, like this:

   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

I can do it with numbering records within group after group by:

>>> dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
>>> dfN
   id  level_1  index  value
0   1        0      0      1
1   1        1      1      2
2   1        2      2      3
3   2        0      3      1
4   2        1      4      2
5   2        2      5      3
6   2        3      6      4
7   3        0      7      1
8   4        0      8      1
>>> dfN[dfN['level_1'] <= 1][['id', 'value']]
   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

smci
  • 26,085
  • 16
  • 96
  • 138
Roman Pekar
  • 92,153
  • 25
  • 168
  • 181
  • Possible duplicate of [Pandas dataframe get first row of each group](http://stackoverflow.com/questions/20067636/pandas-dataframe-get-first-row-of-each-group) – ssoler Sep 26 '16 at 15:51
  • 2
    "top-n" doesn't mean "the n topmost/first/head rows", like you're looking for! It means "the n rows with the largest values". – smci May 13 '18 at 11:50

4 Answers4

227

Did you try df.groupby('id').head(2)

Ouput generated:

>>> df.groupby('id').head(2)
       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1

(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True) to remove the multindex and flatten the results.

>>> df.groupby('id').head(2).reset_index(drop=True)
    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1
dorvak
  • 7,323
  • 4
  • 29
  • 42
  • 1
    Yes, I think that's it. Overlooked this somehow. Do you know good way to number records within group? – Roman Pekar Nov 19 '13 at 10:48
  • Numbering in terms of an index? Or as a separate column? Does this one help http://stackoverflow.com/questions/17775935/sql-like-window-functions-in-pandas-row-numbering-in-python-pandas-dataframe ? – dorvak Nov 19 '13 at 10:52
  • @dvorak I thought about separate column, but index is usable too – Roman Pekar Nov 19 '13 at 10:52
  • 4
    To get output I need, I also added `.reset_index(drop=True)` – Roman Pekar Nov 19 '13 at 10:57
  • 1
    https://github.com/pydata/pandas/pull/5510 was just merged in; will be in 0.13, new method to do exactly this called ``cumcount`` (number the records in each group) – Jeff Nov 19 '13 at 11:10
  • @RomanPekar Also, head will be made much faster using this (hopefully 0.13 too) https://github.com/pydata/pandas/pull/5533 – Andy Hayden Nov 19 '13 at 19:43
  • @dorvak don't reset the index, use as_index=False instead. – Andy Hayden Nov 19 '13 at 19:43
  • @AndyHayden This doesn't produce the same output as with reset_index (actually, it didn't change anything; used ```df.groupby('id',as_index=False).head(2)```) – dorvak Nov 20 '13 at 09:07
  • @dorvak It was comment for Andy not for me? – Roman Pekar Nov 20 '13 at 09:09
  • @dorvak oh, it will in 0.13 - there was a bug in apply maybe. – Andy Hayden Nov 20 '13 at 18:15
  • 3
    To make @dorvak his answer more complete, if you want the 2 smallest values per `id` then do `df.sort_values(['id', 'value'], axis=0).groupby('id').head(2)`. Another example, the largest value per `id` is given by `df.sort_values(['id', 'value'], axis=0).groupby('id').tail(1)`. – Elmex80s Oct 17 '17 at 13:56
  • 1
    As of 0.23.4, there is no need for flattening after `df.groupby('id').head(2)`. [Source](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.head.html#pandas-core-groupby-groupby-head) – rocarvaj Nov 08 '18 at 15:05
  • For `dask` this yields `AttributeError: 'Column not found: head'` – gies0r Aug 09 '20 at 14:28
  • Why must we sort on "id" when it's also (in) the grouping key? – malthe Apr 07 '21 at 18:26
152

Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:

In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]: 
id   
1   2    3
    1    2
2   6    4
    5    3
3   7    1
4   8    1
dtype: int64

There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.

If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.

(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)

LondonRob
  • 53,478
  • 30
  • 110
  • 152
  • There a way to to get `unique_limit(n)`? Like I want the first n unique values? If I ask for `nlargest` it will sort the whole df which can be expensive – citynorman Mar 29 '18 at 00:54
  • 4
    This doesn't work for cases when you do an aggregate on the groupby? For instance, `df.groupby([pd.Grouper(freq='M'), 'A'])['B'].count().nlargest(5, 'B')` This just returns the overall top 5 in the entire series, not by each group – geominded Jun 21 '19 at 07:13
  • 2
    The statement that this is now also possible on `DataFrameGroupBy`s appears to be false, the linked pull request appears to add `nlargest` to simple `DataFrame`s only. Which is rather unfortunate, because what if you want to select more than one column? – oulenz Oct 23 '19 at 13:53
12

Sometimes sorting the whole data ahead is very time consuming. We can groupby first and doing topk for each group:

g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
Chaffee Chen
  • 121
  • 1
  • 3
1
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
  • Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
  • The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
  • reset_index is optional and not necessary.