0

Really struggling with this rolling computation... Any assistance much appreciated.

For dataframe:

    sale_year   seller  item_id  mean_estimate  sale_price  deviation   status
0   2019        bob     1        20000          11000       -9000       sold
1   2019        alice   2        35000          39000       4000        sold
2   2018        bob     3        15000          17000       2000        not sold
3   2017        alice   4        60000          120000      60000       sold
4   2017        alice   5        50000          80000       30000       sold
5   2017        alice   6        60000          120000      60000       sold
6   2017        alice   7        40000          120000      80000       sold
7   2017        bob     8        20000          27000       7000        sold
8   2017        alice   9        200000         175000      -25000      sold
9   2016        alice   10       100000         150000      50000       sold
10  2015        bob     11       75000          100000      25000       sold
11  2015        alice   12       100000         150000      50000       sold      
12  2015        alice   13       10000          15000       5000        sold
13  2015        alice   14       100000         150000      50000       sold
13  2009        alice   15       150000         150000      0           sold        

code:

data = {
"sale_year": [2019, 2019, 2018, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2015, 2015, 2015, 2015, 2009],
"seller": ["bob", "alice", "bob", "alice", "alice", "alice", "alice", "bob", "alice", "alice", "bob", "alice", "alice", "alice", "alice"],
"item_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"mean_estimate": [20000, 35000, 15000, 60000, 50000, 60000, 40000, 20000, 200000, 100000, 75000, 100000, 10000, 100000, 150000],
"sale_price": [11000, 39000, 17000, 120000, 80000, 120000, 120000, 27000, 175000, 150000, 100000, 150000, 15000, 150000, 150000],
"deviation": [-9000, 4000, 2000, 60000, 30000, 60000, 80000, 7000, -25000, 50000, 25000, 50000, 5000, 50000, 0],
"status": ["sold", "sold", "not sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold"]
}
test = pd.DataFrame(data)

I'm evaluating auction sales data and I'm attempting to compute additional variables for each seller, that summarise their performance in the 5 years prior to a given sale (excluding sales from the same year of sale).

  • If an item sold in 2019, variables returned should be a summary of all sales sold by that seller between 2014 and 2018, excluding 2019.
  • If an item sold in 2018 I want variables to relate to sales between 2013 and 2017, excluding 2018.

Variables to generate:

  1. Average of the mean_estimate for the period
  2. Average of the deviation for the period (deviation being the difference between mean_estimate and sale_price)
  3. Standard Deviation of the mean_estimate values for the time period
  4. Standard Deviation of the deviation values for the period
  5. Number of listings in the period
  6. Number of listings that sold in the period
  7. proportion sold ( (listings - sales) / listings )
  8. natural log of number of listing sold ( ln( 1 + sales ) )

Now I know I can use pandas .groupby() and agg functions to generate grouped stats for individual years e.g. df.groupby([df.sale_year, df.seller]).mean_estimate.agg([np.mean, np.std])

and I've made use of smci's helpful advice to use the .rolling() method like so: alv = df.groupby(['seller']).rolling(5, min_periods=1).agg({'mean_estimate': ['mean', 'std', 'count'], 'deviation': ['mean', 'std', 'count']}) However the issue with this is that it includes rows for sales that happened in the same year (where they should be excluded), and the time period I want to make use of is the 5 years prior to a sale, not the 5 most recent transactions (there are many sellers in my data with 5000+ sales).

Example output:

ref_year    seller  avg_est avg_dev sd_est   listings sales prop_sold   ln_sales
2019        bob     36666   11333.3 33291.6  3       2      0.66        ln(3)
2019        alice   80000   40000   54543.5  9       9      1           ln(10)
2018        bob     47500   16000   38890.9  2       2      1           ln(3)
2018        alice   80000   40000   54542.5  9       9      1           ln(10)
...         ...     ...     ...     ...      ...     ...    ...         ...
2010        bob     NaN     NaN     NaN      NaN     NaN    NaN         ln(1)
2010        alice   100000  50000   NA       1       1      1           ln(2) 

Ideally I want to stick to pandas/base python as I want to define all of the computations above as a function to save in a library for ease of future ETL.

Thank you in advance for any assistance

  • UPDATE

I've not been able to find a solution via Pandas groupby and agg method but I have managed to correctly aggregate and perform the computations in a for-loop as per @pygirl's suggestion

cols = ['index', 'sale_year', 'seller', 'realized_price','lot_status', 
        'mean_estimated_usd', 'deviation', 'mean_estimated_usd_log', 'deviation_rel_log', 
        'avgestimate', 'avgdeviation', 'sdestimate', 
        'sddeviation', 'numlistings', 'numlistings_sold', 'propbuyin', 
        'numlistings_log_prep', 'avgestimate_log_prep', 'cvestimate', 
        'cvdeviation']

appended_data = []
df_temp = pd.DataFrame(columns=cols)

for i in test.seller.unique():
    row_sale_year = np.unique(test.sale_year.values)
    row_seller = test.seller.iloc[0]
    for year in row_sale_year:
        # subset data to listingss relevant to a given seller and sale_year
        alv_subset2 = (test.loc[(test.seller==row_seller) & (year-5 <= test.sale_year) & (test.sale_year< year)])
        # filter down to only relevant columns
        alv_subset2 = alv_subset2[['seller', 'sale_price', 'status', 'mean_estimate', 'deviation']]
        alv_subset2['ref_year'] = year
        alv_subset2 = alv_subset2.reset_index()
        
        alv_subset2['avg_sale_price'] = np.mean(alv_subset2.sale_price)
        alv_subset2['avgestimate'] = np.mean(alv_subset2.mean_estimate)
        alv_subset2['numlistings'] = len(alv_subset2.status)
        alv_subset2['numlistings_sold'] = len(alv_subset2[alv_subset2['status']=='sold'])
        alv_subset2['propbuyin'] = (alv_subset2.numlistings-alv_subset2.numlistings_sold) / alv_subset2.numlistings.apply(lambda x: float(x))
        alv_subset2['numlistings_log_prep'] = np.log(1 + alv_subset2.numlistings)
        alv_subset2['avgestimate_log_prep'] = np.log(1 + alv_subset2.avgestimate)
        
        for i in to_float:
            alv_subset2[i] = alv_subset2[i].astype(float, errors = 'raise')
              
        df_temp = df_temp.append(alv_subset2)

df_temp.drop(['index', 'status','realized_price', 'status', 'mean_estimated_usd', 'deviation', 'mean_estimated_usd_log', 'deviation_rel_log'], axis=1)
        appended_data.append(alv_subset2)
        
# see pd.concat documentation for more info
df_temp = pd.concat(appended_data)



72usty
  • 27
  • 1
  • 7
  • This is called a ***rolling window***. In particular this one would be a *rolling 5-year window (backward-looking)* – smci Jan 19 '21 at 16:20
  • Does this answer your question? [1 Year Rolling mean pandas on column date](https://stackoverflow.com/questions/49388065/1-year-rolling-mean-pandas-on-column-date) – smci Jan 19 '21 at 16:33
  • having looked at the rolling method, i'm not sure that it does. This person appears to be explicitly dealing with a datetime column and specifying 'Y'/'D' as their window freq. I'm dealing with an int column for my year and I also want to exclude the actual reference year for the row, I'm looking for the prior 5 year from the reference year. e.g. 2019 should summarise 2013-2018. – 72usty Jan 19 '21 at 16:48
  • 72usty: there are [582(!) Q&A for \[pandas\] rolling groupby](https://stackoverflow.com/search?q=%5Bpandas%5D+rolling+groupby), and there are plenty more on other keywords (e.g. "year window", "year interval"). The onus is on you to look through them. If any of you find a better duplicate target, please cite it here and vote-to-close. – smci Jan 19 '21 at 16:51
  • I've now looked through dozens of previous questions, but I'm yet to find one tackling a similar issue as in my case above. I know I can perform what I'm after in SQL with an inefficient cross join via a where clause t2.sale_year <= t1.sale_year -1 AND t2.sale_year >= t1.sale_year - 6, but I'm trying my best to avoid SQL and to find a efficient pythonic method instead – 72usty Jan 20 '21 at 22:11
  • 72usty: Yeah on closer inspection your case is actually quite rare. I retracted my close-vote. I edited your title and body to explicitly say *"excluding current year"* - sorry I hadn't picked up on that from your original statement. I'll update my answer. – smci Jan 21 '21 at 02:12
  • @smci Thank you for your continued assistance. It is much appreciated – 72usty Jan 21 '21 at 10:22

3 Answers3

1

You want to groupby seller then do aggregate using a 5-year rolling window, using pandas rolling.

a) First, make sale_year the index, so that rolling() can work directly on the index; also it's helpful to sort the df by ascending (not descending) date:

df = df.set_index('sale_year').sort_index()

b) Next, you can use df.rolling(window=5, min_periods=1). The window is '5' of whatever unit the index uses (here: years), and the min_periods=1 argument is to prevent spurious annoying NaNs in the result.

c) Then, compute summary statistics with agg(), passing it a dict of 'column_name': aggregation_function_or_name. These can be a function, string name e.g. pd.Series.mean, 'mean' or else your own custom functions or lambda functions:

df.groupby('seller').rolling(5, min_periods=1).agg({'mean_estimate': 'mean'})

                  mean_estimate
seller sale_year               
alice  2017        60000.000000
       2017       130000.000000
       2019        98333.333333
bob    2015        75000.000000
       2016        87500.000000
       2017        65000.000000
       2018        52500.000000

d) Then add all your other 'field':'function' entries into the aggregate dict. pd.RollingGroupby doesn't have a size method unlike base GroupBy and in fact doesn't support most aggregate methods on RollingGroupby, so you'll have to code some of these.

smci
  • 26,085
  • 16
  • 96
  • 138
  • Thank you for returning with this answer. I've tried to integrate this into my code but there are several issues still present with the solution. In the case of sellers who have over 5 sales in a given year (there are sellers with 1000s of sales in a year) the calculations apply to only the 5 most recent transactions as opposed to all transactions over a 5 year period. Additionally, the calculations include transactions in the same year as opposed to the 5 years prior to the transaction. – 72usty Jan 20 '21 at 18:37
  • 72usty: a) ok please edit your question to show a better reproducible data example, showing sellers with >5 transactions b) if you meant *"1-5 years prior but exlcuding the same year"* then that wasn't clearly stated, please edit the question wording. – smci Jan 20 '21 at 21:03
  • I've improved the data example and edited wording to be additionally clear. Thanks – 72usty Jan 20 '21 at 22:05
  • Even I want to know how to exclude the current same year from this rolling based solution +1. – Pygirl Jan 22 '21 at 19:18
  • @Pygirl I don't think this question will be answered now. I suspect the only way to do this is going to be to brute force it with for loops. – 72usty Jan 24 '21 at 20:41
  • @Pygirl, 72usty: it will be with a logical expression to filter years 1-5 prior. (But not for-loops. Don't say for-loops.) – smci Jan 25 '21 at 00:32
  • @smci I'm not sure I understand what you're referring to. I have since updated the question to include the for-loop I've been working on. It works, to an extent, but there is now the challenge of aggregating the data correctly and selecting the correct rows form the resultant dataframe. – 72usty Jan 25 '21 at 01:05
0

you can try out something like this:

def grpdates(grp):
    row_sale_year = grp.sale_year.values
    row_seller = grp.seller.iloc[0]
    print(row_seller, row_sale_year)
    for year in row_sale_year:
        df2 = (df1.loc[(df1.seller==row_seller) & (year-5<=df1.sale_year) & (df1.sale_year<=year-1)])
        ### apply operation for each seller and for their corresponding year 
        ### and concatenate them horizontally with pd.concat(<>)
        
        print(df2)

df1 = df.loc[df.status=='sold']
df1.groupby('seller').apply(grpdates)
Pygirl
  • 10,115
  • 3
  • 20
  • 33
0

I've not been able to find a solution via Pandas groupby and agg method but I have managed to correctly aggregate and perform the computations in a for-loop as per @pygirl's suggestion.

Any more pythonic edits to the code are welcomed.

Thank you.

cols = ['index', 'sale_year', 'seller', 'realized_price','lot_status', 
        'mean_estimated_usd', 'deviation', 'mean_estimated_usd_log', 'deviation_rel_log', 
        'avgestimate', 'avgdeviation', 'sdestimate', 
        'sddeviation', 'numlistings', 'numlistings_sold', 'propbuyin', 
        'numlistings_log_prep', 'avgestimate_log_prep', 'cvestimate', 
        'cvdeviation']

appended_data = []
df_temp = pd.DataFrame(columns=cols)

for i in test.seller.unique():
    row_sale_year = np.unique(test.sale_year.values)
    row_seller = test.seller.iloc[0]
    for year in row_sale_year:
        # subset data to listingss relevant to a given seller and sale_year
        alv_subset2 = (test.loc[(test.seller==row_seller) & (year-5 <= test.sale_year) & (test.sale_year< year)])
        # filter down to only relevant columns
        alv_subset2 = alv_subset2[['seller', 'sale_price', 'status', 'mean_estimate', 'deviation']]
        alv_subset2['ref_year'] = year
        alv_subset2 = alv_subset2.reset_index()
        
        alv_subset2['avg_sale_price'] = np.mean(alv_subset2.sale_price)
        alv_subset2['avgestimate'] = np.mean(alv_subset2.mean_estimate)
        alv_subset2['numlistings'] = len(alv_subset2.status)
        alv_subset2['numlistings_sold'] = len(alv_subset2[alv_subset2['status']=='sold'])
        alv_subset2['propbuyin'] = (alv_subset2.numlistings-alv_subset2.numlistings_sold) / alv_subset2.numlistings.apply(lambda x: float(x))
        alv_subset2['numlistings_log_prep'] = np.log(1 + alv_subset2.numlistings)
        alv_subset2['avgestimate_log_prep'] = np.log(1 + alv_subset2.avgestimate)
        
        for i in to_float:
            alv_subset2[i] = alv_subset2[i].astype(float, errors = 'raise')
              
        df_temp = df_temp.append(alv_subset2)

df_temp.drop(['index', 'status','realized_price', 'status', 'mean_estimated_usd', 'deviation', 'mean_estimated_usd_log', 'deviation_rel_log'], axis=1)
        appended_data.append(alv_subset2)
        
# see pd.concat documentation for more info
df_temp = pd.concat(appended_data)



72usty
  • 27
  • 1
  • 7