11

I would like to compute the 1-year rolling average for each row in this Dataframe test:

index   id      date        variation
2313    7034    2018-03-14  4.139148e-06
2314    7034    2018-03-13  4.953194e-07
2315    7034    2018-03-12  2.854749e-06
2316    7034    2018-03-09  3.907458e-06
2317    7034    2018-03-08  1.662412e-06
2318    7034    2018-03-07  1.346433e-06
2319    7034    2018-03-06  8.731700e-06
2320    7034    2018-03-05  7.145597e-06
2321    7034    2018-03-02  4.893283e-06
...

For example, I would need to calculate:

  • mean of variation of id 7034 between 2018-03-14 and 2017-08-14
  • mean of variation of id 7034 between 2018-03-13 and 2017-08-13
  • etc.

I tried:

test.groupby(['id','date'])['variation'].rolling(window=1,freq='Y',on='date').mean()

but I got the error message:

ValueError: invalid on specified as date, must be a column (if DataFrame) or None

How can I use the pandas rolling() function in this case?


[EDIT 1] [thanks to Sacul]

I tested:

df['date'] = pd.to_datetime(df['date'])

df.set_index('date').groupby('id').rolling(window=1, freq='Y').mean()['variation']

But freq='Y' doesn't work (I got: ValueError: Invalid frequency: Y) Then I used window = 365, freq = 'D'.

But there is another issue: because there are never 365 consecutive dates for each combined id-date, the result is always empty. Even if there missing dates, I would like to ignore them and consider all dates between the current date and the (current date - 365) to compute the rolling mean. For instance, imagine I have:

index   id      date        variation
2313    7034    2018-03-14  4.139148e-06
2314    7034    2018-03-13  4.953194e-07
2315    7034    2017-03-13  2.854749e-06

Then,

  • for 7034 2018-03-14: I would like to compute MEAN(4.139148e-06,4.953194e-07, 2.854749e-06)
  • for 7034 2018-03-13: I would like to compute also MEAN(4.139148e-06,4.953194e-07, 2.854749e-06)

How can I do that?


[EDIT 2]

Finally I used the formula below to calculate rolling median, averages and standard deviation on 1 Year by ignoring missing values:

pd.rolling_median(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)

pd.rolling_mean(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)

pd.rolling_std(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)
smci
  • 26,085
  • 16
  • 96
  • 138
Thomas
  • 111
  • 1
  • 5
  • 6
    Well formatted, shows an attempt, explicitly gives the error. Welcome to stackoverflow! – Drise Mar 20 '18 at 15:24

1 Answers1

4

I believe this should work for you:

# First make sure that `date` is a datetime object:

df['date'] = pd.to_datetime(df['date'])

df.set_index('date').groupby('id').rolling(window=1, freq='A').mean()['variation']

using pd.DataFrame.rolling with datetime works well when the date is the index, which is why I used df.set_index('date') (as can be seen in one of the documentation's examples)

I can't really test if it works on the year's average on your example dataframe, as there is only one year and only one ID, but it should work.

Arguably Better Solution:

[EDIT] As pointed out by Mihai-Andrei Dinculescu, freq is now a deprecated argument. Here is an alternative (and probably more future-proof) way to do what you're looking for:

df.set_index('date').groupby('id')['variation'].resample('A').mean()

You can take a look at the resample documentation for more details on how this works, and this link regarding the frequency arguments.

Community
  • 1
  • 1
sacuL
  • 42,057
  • 8
  • 58
  • 83
  • 1
    `ValueError: Invalid frequency: Y` – Mihai Dinculescu Mar 20 '18 at 15:47
  • 1
    It might be related to the fact that `freq` is deprecated. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html – Mihai Dinculescu Mar 20 '18 at 15:52
  • It's strange, it works for me. based on [this link](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases), does it work when you try using frequency `'A'`? Should do the same thing. – sacuL Mar 21 '18 at 11:58
  • `A` works fine. What version of pandas are you using? – Mihai Dinculescu Mar 21 '18 at 12:48
  • `0.21.1`. Glad it works. It seems maybe in older versions, `'A'` is the only accepted alias for year end frequency ([as seen in this answer](https://stackoverflow.com/a/35339226/6671176)), but the current [documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) includes `'A'` and `'Y'` as aliases. – sacuL Mar 21 '18 at 16:15
  • I'm on version 0.22.0 – Mihai Dinculescu Mar 21 '18 at 20:43
  • Then I'm at a loss :\ – sacuL Mar 22 '18 at 01:22