Use pandas to group by column and then create a new column based on a condition

Question

I need to reproduce with pandas what SQL does so easily:

select
    del_month
    , sum(case when off0_on1 = 1 then 1 else 0 end) as on1
    , sum(case when off0_on1 = 0 then 1 else 0 end) as off0
from a1
group by del_month
order by del_month

Here is a sample, illustrative pandas dataframe to work on:

a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 'off0_on1':[0,0,1,1,0,1,1,1]})

Here are my attempts to reproduce the above SQL with pandas. The first line works. The second line gives an error:

a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(sum)
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(sum(lambda x: 1 if x == 0 else 0))

Here's the second line's error:

TypeError: 'function' object is not iterable

This previous question of mine had a problem with the lambda function, which was solved. The bigger problem is how to reproduce SQL's "sum(case when)" logic on grouped data. I'm looking for a general solution, since I need to do this sort of thing often. The answers in my previous question suggested using map() inside the lambda function, but the following results for the "off0" column are not what I need. The "on1" column is what I want. The answer should be the same for the whole group (i.e. "del_month").

Parfait · Accepted Answer · 2018-11-01T18:55:20.547

Simply sum the Trues in your conditional logic expressions:

import pandas as pd

a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 
                   'off0_on1':[0,0,1,1,0,1,1,1]})

a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==1))    
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==0))

print(a1)    
#    del_month  off0_on1  on1  off0
# 0          1         0    2     2
# 1          1         0    2     2
# 2          1         1    2     2
# 3          1         1    2     2
# 4          2         0    3     1
# 5          2         1    3     1
# 6          2         1    3     1
# 7          2         1    3     1

Similarly, you can do the same in SQL if dialect supports it which most should:

select
    del_month
    , sum(off0_on1 = 1) as on1
    , sum(off0_on1 = 0) as off0
from a1
group by del_month
order by del_month

And to replicate above SQL in pandas, don't use transform but send multiple aggregates in a groupby().apply() call:

def aggfunc(x):
    data = {'on1': sum(x['off0_on1'] == 1),
            'off0': sum(x['off0_on1'] == 0)}

    return pd.Series(data)

g = a1.groupby('del_month').apply(aggfunc)

print(g)    
#            on1  off0
# del_month           
# 1            2     2
# 2            3     1

Beautiful. That's exactly what I was looking for. Thanks so much! Is there now a way of collapsing the "del_month" (as in the SQL example code) without chaining another groupby? — Sean McCarthy, Nov 01 '18 at 18:37
Great to hear! To collapse `del_month`, don't use `transform` (used for inline aggregates) but run multiple aggregates just on `groupby`. — Parfait, Nov 01 '18 at 18:49
would you mind typing out an example for me? I'll up-vote it. Thanks a lot. I'm new to this. :) — Sean McCarthy, Nov 01 '18 at 18:54

cs95 · Answer 2 · 2018-11-01T19:55:54.453

2

Using get_dummies would only need a single groupby call, which is simpler.

v = pd.get_dummies(df.pop('off0_on1')).groupby(df.del_month).transform(sum)
df = pd.concat([df, v.rename({0: 'off0', 1: 'on1'}, axis=1)], axis=1)

df
   del_month  off0  on1
0          1     2    2
1          1     2    2
2          1     2    2
3          1     2    2
4          2     1    3
5          2     1    3
6          2     1    3
7          2     1    3

Additionally, for the case of aggregation, call sum directly instead of using apply:

(pd.get_dummies(df.pop('off0_on1'))
   .groupby(df.del_month)
   .sum()
   .rename({0: 'off0', 1: 'on1'}, axis=1))

           off0  on1
del_month           
1             2    2
2             1    3

edited Nov 01 '18 at 19:55

answered Nov 01 '18 at 19:32

cs95

274,032
76
480
537

Very interesting solution. You're very creative. Not sure if this is quite as generalizable as @Parfait's solution, but I'm definitely going to give it some serious thought. Also, I'm a newb so I can't tell which is better.. :P – Sean McCarthy Nov 01 '18 at 19:35
1

@Sean_Calgary You use `str.get_dummies` to convert the off_on column to a DataFrame of 2 columns of one hot encodings, and then just sum the ones... it's the exact same thing as Parfait but accomplishes it in one groupby. The second line is just doing some housekeeping to get your column names. If you want to figure out which is better, I recommend running both solutions on your data and then using whatever works quicker for you. – cs95 Nov 01 '18 at 19:53
You guys are amazing. That's such an elegant and creative solution. You must have an IQ of 170! I'm not sure I can use pd.get_dummies() in all the situations in which I can use apply(custom_function), but maybe I just need to try it and think about it more. Bravo! – Sean McCarthy Nov 01 '18 at 20:03
@Sean_Calgary Not quite there yet but nonetheless you're welcome. – cs95 Nov 01 '18 at 20:10

Use pandas to group by column and then create a new column based on a condition

2 Answers2

Linked