2

I want to create a column churn as shown. The code should group and compare each year's column Col and assign 0 if it finds Col value in next year.

In this example 3rd row is missing from 2017. Hence assigning 1.

How do I do this in pandas?

State ID    Col   Year  cost  Churn
CT    123   M     2016  10    0
CT    123   C     2016  15    0
CT    123   A     2016  10    1
CT    123   C     2016  20    0
CT    123   M     2017  10    0
CT    123   C     2017  15    0
Priyank
  • 134
  • 9
  • what if a customer leaves and then returns? i.e. A appears in 2015, is missing from 2016, and then appears again in 2017? – Roy2012 Aug 11 '20 at 05:19
  • If A disappear in 2016, then assign 1 in year 2015 because they churned. If they come back in 2017, then assign 0 because thats a new customer they have not churned – Priyank Aug 11 '20 at 05:22

1 Answers1

2

First add all missing combinations of first 4 columns by Series.reindex with MultiIndex.from_product, then shift per first 3 columns by DataFrameGroupBy.shift and last use DataFrame.merge for original order and remove all added rows (if no parameter on it use all columns wich are same in both DataFrames):

s = df.assign(Churn=0).set_index(['State','ID','Col','Year'])['Churn']
df1 = df.merge(s.reindex(pd.MultiIndex.from_product(s.index.levels), fill_value=1)
                .groupby(level=[0,1,2])
                .shift(-1, fill_value=0)
                .reset_index())
print (df1)
  State   ID Col  Year  Churn
0    CT  123   M  2016      0
1    CT  123   C  2016      0
2    CT  123   A  2016      1
3    CT  123   M  2017      0
4    CT  123   C  2017      0
jezrael
  • 629,482
  • 62
  • 918
  • 895
  • is there a way to do it without index, I get error on saying `cannot handle a non-unique multi-index`, also don't understand why are shifting -1. I am trying to compare a combination of columns and column `Col` within each year to decide churn – Priyank Aug 11 '20 at 05:54
  • @Priyank - It means there are duplicates, like first `C` in sample data is `M`, how are processing this data? – jezrael Aug 11 '20 at 05:55
  • I understand there are duplicates but not sure why that is important in calculating `churn` indicator. If a combination repeats within a year it's fine, if it repeats in next year then 0 if it doesn't then 1 in the previous year – Priyank Aug 11 '20 at 06:06
  • @Priyank - yop, agree. There are columns only `['State','ID','Col','Year']` in input data? If yes, then change `s = df.assign(Churn=0).set_index(['State','ID','Col','Year'])['Churn']` to `s = df.drop_duplicates().assign(Churn=0).set_index(['State','ID','Col','Year'])['Churn']` – jezrael Aug 11 '20 at 06:07
  • Actually, there are other column in the data which are relevant and have different values for in same year. I tried `drop_duplicates` gives me same error – Priyank Aug 11 '20 at 06:28
  • @Priyank - ok, then use `s = df.drop_duplicates(subset=['State','ID','Col','Year']).assign(Churn=0).set_index(['State','ID','Col','Year'])['Churn']` – jezrael Aug 11 '20 at 06:28
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219585/discussion-between-priyank-and-jezrael). – Priyank Aug 11 '20 at 06:38