Really struggling with this rolling computation... Any assistance much appreciated.
For dataframe:
sale_year seller item_id mean_estimate sale_price deviation status
0 2019 bob 1 20000 11000 -9000 sold
1 2019 alice 2 35000 39000 4000 sold
2 2018 bob 3 15000 17000 2000 not sold
3 2017 alice 4 60000 120000 60000 sold
4 2017 alice 5 50000 80000 30000 sold
5 2017 alice 6 60000 120000 60000 sold
6 2017 alice 7 40000 120000 80000 sold
7 2017 bob 8 20000 27000 7000 sold
8 2017 alice 9 200000 175000 -25000 sold
9 2016 alice 10 100000 150000 50000 sold
10 2015 bob 11 75000 100000 25000 sold
11 2015 alice 12 100000 150000 50000 sold
12 2015 alice 13 10000 15000 5000 sold
13 2015 alice 14 100000 150000 50000 sold
13 2009 alice 15 150000 150000 0 sold
code:
data = {
"sale_year": [2019, 2019, 2018, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2015, 2015, 2015, 2015, 2009],
"seller": ["bob", "alice", "bob", "alice", "alice", "alice", "alice", "bob", "alice", "alice", "bob", "alice", "alice", "alice", "alice"],
"item_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"mean_estimate": [20000, 35000, 15000, 60000, 50000, 60000, 40000, 20000, 200000, 100000, 75000, 100000, 10000, 100000, 150000],
"sale_price": [11000, 39000, 17000, 120000, 80000, 120000, 120000, 27000, 175000, 150000, 100000, 150000, 15000, 150000, 150000],
"deviation": [-9000, 4000, 2000, 60000, 30000, 60000, 80000, 7000, -25000, 50000, 25000, 50000, 5000, 50000, 0],
"status": ["sold", "sold", "not sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold"]
}
test = pd.DataFrame(data)
I'm evaluating auction sales data and I'm attempting to compute additional variables for each seller, that summarise their performance in the 5 years prior to a given sale (excluding sales from the same year of sale).
- If an item sold in 2019, variables returned should be a summary of all sales sold by that seller between 2014 and 2018, excluding 2019.
- If an item sold in 2018 I want variables to relate to sales between 2013 and 2017, excluding 2018.
Variables to generate:
- Average of the mean_estimate for the period
- Average of the deviation for the period (deviation being the difference between mean_estimate and sale_price)
- Standard Deviation of the mean_estimate values for the time period
- Standard Deviation of the deviation values for the period
- Number of listings in the period
- Number of listings that sold in the period
- proportion sold ( (listings - sales) / listings )
- natural log of number of listing sold ( ln( 1 + sales ) )
Now I know I can use pandas .groupby() and agg functions to generate grouped stats for individual years
e.g.
df.groupby([df.sale_year, df.seller]).mean_estimate.agg([np.mean, np.std])
and I've made use of smci's helpful advice to use the .rolling()
method like so:
alv = df.groupby(['seller']).rolling(5, min_periods=1).agg({'mean_estimate': ['mean', 'std', 'count'], 'deviation': ['mean', 'std', 'count']})
However the issue with this is that it includes rows for sales that happened in the same year (where they should be excluded), and the time period I want to make use of is the 5 years prior to a sale, not the 5 most recent transactions (there are many sellers in my data with 5000+ sales).
Example output:
ref_year seller avg_est avg_dev sd_est listings sales prop_sold ln_sales
2019 bob 36666 11333.3 33291.6 3 2 0.66 ln(3)
2019 alice 80000 40000 54543.5 9 9 1 ln(10)
2018 bob 47500 16000 38890.9 2 2 1 ln(3)
2018 alice 80000 40000 54542.5 9 9 1 ln(10)
... ... ... ... ... ... ... ... ...
2010 bob NaN NaN NaN NaN NaN NaN ln(1)
2010 alice 100000 50000 NA 1 1 1 ln(2)
Ideally I want to stick to pandas/base python as I want to define all of the computations above as a function to save in a library for ease of future ETL.
Thank you in advance for any assistance
- UPDATE
I've not been able to find a solution via Pandas groupby and agg method but I have managed to correctly aggregate and perform the computations in a for-loop as per @pygirl's suggestion
cols = ['index', 'sale_year', 'seller', 'realized_price','lot_status',
'mean_estimated_usd', 'deviation', 'mean_estimated_usd_log', 'deviation_rel_log',
'avgestimate', 'avgdeviation', 'sdestimate',
'sddeviation', 'numlistings', 'numlistings_sold', 'propbuyin',
'numlistings_log_prep', 'avgestimate_log_prep', 'cvestimate',
'cvdeviation']
appended_data = []
df_temp = pd.DataFrame(columns=cols)
for i in test.seller.unique():
row_sale_year = np.unique(test.sale_year.values)
row_seller = test.seller.iloc[0]
for year in row_sale_year:
# subset data to listingss relevant to a given seller and sale_year
alv_subset2 = (test.loc[(test.seller==row_seller) & (year-5 <= test.sale_year) & (test.sale_year< year)])
# filter down to only relevant columns
alv_subset2 = alv_subset2[['seller', 'sale_price', 'status', 'mean_estimate', 'deviation']]
alv_subset2['ref_year'] = year
alv_subset2 = alv_subset2.reset_index()
alv_subset2['avg_sale_price'] = np.mean(alv_subset2.sale_price)
alv_subset2['avgestimate'] = np.mean(alv_subset2.mean_estimate)
alv_subset2['numlistings'] = len(alv_subset2.status)
alv_subset2['numlistings_sold'] = len(alv_subset2[alv_subset2['status']=='sold'])
alv_subset2['propbuyin'] = (alv_subset2.numlistings-alv_subset2.numlistings_sold) / alv_subset2.numlistings.apply(lambda x: float(x))
alv_subset2['numlistings_log_prep'] = np.log(1 + alv_subset2.numlistings)
alv_subset2['avgestimate_log_prep'] = np.log(1 + alv_subset2.avgestimate)
for i in to_float:
alv_subset2[i] = alv_subset2[i].astype(float, errors = 'raise')
df_temp = df_temp.append(alv_subset2)
df_temp.drop(['index', 'status','realized_price', 'status', 'mean_estimated_usd', 'deviation', 'mean_estimated_usd_log', 'deviation_rel_log'], axis=1)
appended_data.append(alv_subset2)
# see pd.concat documentation for more info
df_temp = pd.concat(appended_data)