How to get the last date in custom interval? - Pandas

Question

May my example be a lot big,my code is here:

import pandas as pd
import numpy as np
import io
t = """
name     date
a     2005-08-31
a     2005-09-20
a     2005-11-12
a     2005-12-31
a     2006-03-31
a     2006-06-25
a     2006-07-23
a     2006-09-28
a     2006-12-21
a     2006-12-27
a     2007-07-23
a     2007-09-21
a     2007-03-15
a     2008-04-12
a     2008-06-21
a     2008-06-11
b     2005-08-31
b     2005-09-23
b     2005-11-12
b     2005-12-31
b     2006-03-31
b     2006-06-25
b     2006-07-23
b     2006-09-28
b     2006-12-21
b     2006-12-27
b     2007-07-23
b     2007-09-21
b     2007-03-15
b     2008-04-12
b     2008-06-21
b     2008-06-11
"""
data=pd.read_csv(io.StringIO(t),delimiter='     ')#5 space here
data

What I want to do is to find the every last day of the year which year begin2005-7-1) and end 2006-06-30 ,begin 2006-7-1 and end 2007-6-30...and so on . And my expected output is here:

name     date
a     2006-06-25  #the last day of the 2005/7/01 -2006/06/31
a     2007-03-15  #the last day of the 2006/7/01 -2007/06/31
a     2008-06-21  #the last day of the 2007/7/01 -2008/06/31
b     2006-06-25  #the last day of the 2005/7/01 -2006/06/31
b     2007-03-15  #the last day of the 2006/7/01 -2007/06/31
b     2008-06-21  #the last day of the 2007/7/01 -2008/06/31

How can fix this ? I think I should use the custom

Note: September only has 30 days, so there is some bad data in there. — Andy Hayden, Oct 21 '17 at 04:17

score 5 · Accepted Answer · answered Oct 21 '17 at 05:00

5

You can do this without the rollback using a single groupby:

In [11]: data.date = pd.to_datetime(data.date, format="%Y-%m-%d")

In [12]: df.groupby(["name", pd.Grouper(key="date", freq="AS-JUL")])["date"].max()
Out[12]:
name  date
a     2005-07-01   2006-06-25
      2006-07-01   2007-03-15
      2007-07-01   2008-06-21
b     2005-07-01   2006-06-25
      2006-07-01   2007-03-15
      2007-07-01   2008-06-21
Name: date, dtype: datetime64[ns]

answered Oct 21 '17 at 05:00

Andy Hayden

291,328
80
565
500

I was using the freq and grouper the whole time din't know it can take offsets too. Super +1. I will remember this one – Bharath Oct 21 '17 at 05:00
Adding as another answer, as it's pretty different (a lot shorter)! – Andy Hayden Oct 21 '17 at 05:01
I didn't know offset aliases. Now I got a stronger grip towards them. All i was doing was 3D , 5D .... – Bharath Oct 21 '17 at 05:16
1

@Bharathshetty you should very familiar with the method I use ~ :-) – BENY Oct 21 '17 at 05:23
Never mean to bother,But if I want the`[2015-7-02,2016-6-25]` as a custom year, how can I do this? – ileadall42 Oct 22 '17 at 05:35
@Tangfeifan interesting, please ask a new question :) I am not sure at the moment, but if you ask as a new question I think someone will be able to answer :) (Just the "[2015-7-02,2016-6-25] as a custom year" part) – Andy Hayden Oct 22 '17 at 05:38
OK,I have been fix it ,just set the `freq=365` and specify the start and end ,is OK!Thanks for your help – ileadall42 Oct 22 '17 at 05:45
@Tangfeifan no problem, feel free to link to a new question here if you make a new question! – Andy Hayden Oct 22 '17 at 05:49

Andy Hayden · Answer 2 · 2017-10-21T04:52:59.247

4

Well, this seems like a magical way!
The frequency is "AS-JUL" (which is year start frequency, starting in July).

We'll first take the beginning of each month (since you have some bad dates in there, let's just ignore them) but the critical thing is we need it to be datetime rather than string:

In [11]: pd.to_datetime(data.date.str[:7], format="%Y-%m")  # to beginning of month
Out[11]:
0    2005-08-01
1    2005-09-01
2    2005-11-01
3    2005-12-01
...

In [12]: df.date = pd.to_datetime(data.date.str[:7], format="%Y-%m")

Now here comes the magic:

In [13]: from pandas.tseries.frequencies import to_offset

In [14]: df.date.map(to_offset("AS-JUL").rollback)
Out[14]:
0    2005-07-01
1    2005-07-01
2    2005-07-01
3    2005-07-01
4    2005-07-01
5    2005-07-01
6    2006-07-01
7    2006-07-01
8    2006-07-01
9    2006-07-01
10   2007-07-01
11   2007-07-01
12   2006-07-01
13   2007-07-01
14   2007-07-01
15   2007-07-01
16   2005-07-01
17   2005-07-01
18   2005-07-01
19   2005-07-01
20   2005-07-01
21   2005-07-01
22   2006-07-01
23   2006-07-01
24   2006-07-01
25   2006-07-01
26   2007-07-01
27   2007-07-01
28   2006-07-01
29   2007-07-01
30   2007-07-01
31   2007-07-01
Name: date, dtype: datetime64[ns]

We created an offset to "AS-JUL" and rolled it back (meaning floor).
Note: For whatever reason we can't use dt.floor...

Okay, misread this part, you want the latest recorded date for each group in each period, with the corrected dates, the last part is just a groupby:

In [21]: data.date = pd.to_datetime(data.date, format="%Y-%m-%d")

In [22]: data["period_start"] = data.date.map(to_offset("AS-JUL").rollback).dt.normalize()

In [23]: data.groupby(["name", "period_start"])["date"].max()
Out[23]:
name  period_start
a     2005-07-01     2006-06-25
      2006-07-01     2007-03-15
      2007-07-01     2008-06-21
b     2005-07-01     2006-06-25
      2006-07-01     2007-03-15
      2007-07-01     2008-06-21
Name: date, dtype: datetime64[ns]

edited Oct 21 '17 at 04:52

answered Oct 21 '17 at 04:38

Andy Hayden

291,328
80
565
500

Well these are really new for me – Bharath Oct 21 '17 at 04:40
@Bharathshetty new to me also, very neat trick (the "magic" is from the [linked answer](https://stackoverflow.com/a/45963946/1240268)). For the other offsets see: https://stackoverflow.com/a/35339226/1240268. – Andy Hayden Oct 21 '17 at 04:41
I loved the solution but the answer OP is looking for I think is different. He want to find last working day in the interval I think. We need groupby then check for the max date by creating an interval index I think – Bharath Oct 21 '17 at 04:43
@Bharathshetty oh no!! – Andy Hayden Oct 21 '17 at 04:44
1

@Bharathshetty I will modify, the trick gets you most of the way there. – Andy Hayden Oct 21 '17 at 04:45
Yeah@Bharathshetty is right! I need to find the last day in the intervalu ! – ileadall42 Oct 21 '17 at 04:51
Even I came up with almost same thing I think – Bharath Oct 21 '17 at 04:54
1

@Bharathshetty there's a better trick, see my other answer :) – Andy Hayden Oct 21 '17 at 05:00
@AndyHayden [Here is the new](https://stackoverflow.com/questions/46871173/how-can-i-astype-the-freq-as-what-i-want) – ileadall42 Oct 22 '17 at 06:34

Bharath · Answer 3 · 2017-10-21T05:04:50.593

3

From the beautiful function to_offset @Andy suggested we can do

from pandas.tseries.frequencies import to_offset
new = data.groupby('name').apply(lambda x : x.groupby(x['date'].map(to_offset("AS-JUL"))).max())

             name       date
name date                      
a    2006-07-01    a 2006-06-25
     2007-07-01    a 2007-03-15
     2008-07-01    a 2008-06-21
b    2006-07-01    b 2006-06-25
     2007-07-01    b 2007-03-15
     2008-07-01    b 2008-06-21

edited Oct 21 '17 at 05:04

answered Oct 21 '17 at 04:53

Bharath

26,906
5
45
85

1

You both do great but I just can accept one ,so I will accept the faster one Andy,but also thanks you very much! – ileadall42 Oct 21 '17 at 04:55
Even I will feel bad if you dont accept his solution. – Bharath Oct 21 '17 at 04:56

score 3 · Answer 4 · answered Oct 21 '17 at 05:22

3

By using IntervalIndex(DF is your DataFrame)

idx=pd.IntervalIndex.from_arrays(pd.date_range(start='2005-07-01',freq='12MS',periods=12),pd.date_range(start='2006-06-30',freq='12M',periods=12),closed='both')
df=pd.DataFrame({'G':list(range(len(idx)))},index=idx)
DF.date=pd.to_datetime(DF.date)
DF['G']=df.loc[DF.date].values
DF.sort_values(['name','date']).drop_duplicates(['name','G'],keep='last')

Out[19]: 
   name       date  G
5     a 2006-06-25  0
12    a 2007-03-15  1
14    a 2008-06-21  2
21    b 2006-06-25  0
28    b 2007-03-15  1
30    b 2008-06-21  2

answered Oct 21 '17 at 05:22

BENY

258,262
17
121
165

I have to say this is very easy to understand!Thanks to you. – ileadall42 Oct 21 '17 at 05:34
I do you know this is my first approach I used `beg = np.array([['{}/7/01'.format(i),'{}/6/30'.format(i+1)] for i in range(2005,2010)])` `index = pd.IntervalIndex.from_arrays(pd.to_datetime(beg[:,0]),pd.to_datetime(beg[:,1]))`. Yours is better – Bharath Oct 21 '17 at 05:35
@Tangfeifan Yw~ :-) – BENY Oct 21 '17 at 05:37
I knew grouper but not offset aliases. Its new to me and is fantastic – Bharath Oct 21 '17 at 05:38

How to get the last date in custom interval? - Pandas

4 Answers4