92

What is the best way to do a groupby on a Pandas dataframe, but exclude some columns from that groupby? e.g. I have the following dataframe:

Code   Country      Item_Code   Item    Ele_Code    Unit    Y1961    Y1962   Y1963
2      Afghanistan  15          Wheat   5312        Ha      10       20      30
2      Afghanistan  25          Maize   5312        Ha      10       20      30
4      Angola       15          Wheat   7312        Ha      30       40      50
4      Angola       25          Maize   7312        Ha      30       40      50

I want to groupby the column Country and Item_Code and only compute the sum of the rows falling under the columns Y1961, Y1962 and Y1963. The resulting dataframe should look like this:

Code   Country      Item_Code   Item    Ele_Code    Unit    Y1961    Y1962   Y1963
2      Afghanistan  15          C3      5312        Ha      20       40       60
4      Angola       25          C4      7312        Ha      60       80      100

Right now I am doing this:

df.groupby('Country').sum()

However this adds up the values in the Item_Code column as well. Is there any way I can specify which columns to include in the sum() operation and which ones to exclude?

karel
  • 3,880
  • 31
  • 37
  • 42
user308827
  • 18,706
  • 61
  • 194
  • 336

3 Answers3

127

You can select the columns of a groupby:

In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
                       Y1961  Y1962  Y1963
Country     Item_Code
Afghanistan 15            10     20     30
            25            10     20     30
Angola      15            30     40     50
            25            30     40     50

Note that the list passed must be a subset of the columns otherwise you'll see a KeyError.

Andy Hayden
  • 291,328
  • 80
  • 565
  • 500
  • 1
    How to include count of records for each country & item code as another column? – Sushant Kulkarni Jun 12 '17 at 23:42
  • You could create a dummy column before grouping by that just contains 1. then sum will sum those up creating a count. – Matt W. Feb 13 '18 at 22:36
  • If you just want to exclude a column or two, then you get all the columns names as in `listColumns = list(df.columns)` then you remove the columns you don't want, `listColumns.remove('Y1964')` and finally do your summation: `df.groupby(['Country', 'Item_Code'])[listColumns].sum()` – Roberto Stelling Feb 19 '18 at 15:21
  • Thanks a lot. I can get the groupby to work, but not the selection part. The list of columns I put in are among the dataframe, but it keeps raising ValueError: `cannot reindex from a duplicate axis` – Bowen Liu Oct 31 '18 at 15:34
  • @BowenLiu if you have multiple columns with the same name it'll show this error. In this case you'll have to use iloc to or loc to get the columns you want, I think you'll have to do that prior to the groupby. – Andy Hayden Oct 31 '18 at 16:45
  • Thanks a lot, it is working now as I selected a smaller dataframe with manageable columns. I didn't know that it wouldn't work as long as there are columns with identical names. I thought as long as I don't select these columns, it will be fine. And I'm wondering if you can help me with a question that arose for me in this process: why does `df.groupby({columns1])[columns2].sum()` return a dataframe whereas `df.groupby({columns1])[columns2].size()` will give me a data series as output? Thanks. – Bowen Liu Oct 31 '18 at 17:11
  • @BowenLiu I don't have a good answer for that one. Re the non-selection you might like to file a bug on github. – Andy Hayden Oct 31 '18 at 17:14
  • Thanks, and by "non-selection" do you mean the first part of my comment? I will delete the columns with duplicate names and try it again and see if it's the problem. What shall be the best way to go about it? Can I change column name or delete column using `iloc` as specifying name doesn't work here. Thanks again – Bowen Liu Oct 31 '18 at 17:18
  • @BowenLiu I mean you can slice out the columns you want `df.iloc[:, [0,1, 2, 3]].groupby(...)` where 0, 1, 2 are the columns you want to slice. Renaming so that you don't have an duplicates is also a solution – Andy Hayden Oct 31 '18 at 19:08
  • Thanks Andy. I used your code again today to work on a different project and it works like charm. But when I was playing around with your code, I found that `df.groupby([A, B, C)[[D, E, F]].sum()` seems to give the same result as `df.groupby([A, B, C)[D, E, F].sum()` . Could you tell me why you used two pair of brackets please? Thanks again. – Bowen Liu Dec 06 '18 at 20:44
  • @BowenLiu [[..]] is usually required to slice a subset of columns, that's the case with DataFrames BUT it _looks_ like that's no longer necessary with a DataFrameGroupBy (you can now use either). – Andy Hayden Dec 06 '18 at 23:37
  • Thanks! I thought that probably has something to do with columns slicing. – Bowen Liu Dec 07 '18 at 03:40
41

The agg function will do this for you. Pass the columns and function as a dict with column, output:

df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]})  # Added example for two output columns from a single input column

This will display only the group by columns, and the specified aggregate columns. In this example I included two agg functions applied to 'Y1962'.

To get exactly what you hoped to see, included the other columns in the group by, and apply sums to the Y variables in the frame:

df.groupby(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit']).agg({'Y1961': np.sum, 'Y1962': np.sum, 'Y1963': np.sum})
leroyJr
  • 1,010
  • 8
  • 16
  • 1
    thanks, can this be generalized? I have lots of columns of the form Y1961... so I generate a list like this: yrs = ['Y' + str(x) for x in range(1961, 2010 + 1, 1)]. Can your solution use 'yrs' inside agg? – user308827 Sep 24 '15 at 00:11
  • I really like this idea. The trick is constructing this dict with the value being the numpy sum function. Conversely though, if all you want to do is sum all of the remaining columns, your original-ish solution would work if all of the group by columns are included in the group by statement. – leroyJr Sep 24 '15 at 00:21
11

If you are looking for a more generalized way to apply to many columns, what you can do is to build a list of column names and pass it as the index of the grouped dataframe. In your case, for example:

columns = ['Y'+str(i) for year in range(1967, 2011)]

df.groupby('Country')[columns].agg('sum')
Richard
  • 44,865
  • 24
  • 144
  • 216
Superstar
  • 309
  • 1
  • 4
  • 7