0

I have a dataframe which is a series of rows containing identifiers of individuals, dates of their appointments and a numerical value.

For every row, I need the sum of the numerical values for that individual over the past year. I effectively want to do:

df.groupby('ID')['numerical value'].cumsum()

but I want the cumsum to include only rows where the date is up to a year in the past.

I'm currently using the following apply function but it is very slow when I have a long dataframe.

def sum_hist(appt_date,personID,key):
    search_limit=appt_date+pd.Timedelta("-365 days")
    temp_df=df[(df["ID number"]==personID)
    & (df["Appointment Date"]>search_limit) & (df["Appointment Date"]<=appt_date)]
    return(sum(temp_df[key].values))

df[key]=df[["Appointment Date","ID number"]].apply(lambda x: sum_hist(*x,key),axis=1)

I'd appreciate any method to do that same thing that is more efficient.

1 Answers1

0

It turns out this can be achieved with the pandas rolling function:

df.groupby('ID number').rolling(window='365d',min_periods=1,on="Appointment Date")['numerical value'].sum()

It greatly improves the speed and makes my problem possible to run in less than a minute rather than the hours it was taking previously.

Thanks @SH-SF for the link to 1 Year Rolling mean pandas on column date