2

May my example be a lot big,my code is here:

import pandas as pd
import numpy as np
import io
t = """
name     date
a     2005-08-31
a     2005-09-20
a     2005-11-12
a     2005-12-31
a     2006-03-31
a     2006-06-25
a     2006-07-23
a     2006-09-28
a     2006-12-21
a     2006-12-27
a     2007-07-23
a     2007-09-21
a     2007-03-15
a     2008-04-12
a     2008-06-21
a     2008-06-11
b     2005-08-31
b     2005-09-23
b     2005-11-12
b     2005-12-31
b     2006-03-31
b     2006-06-25
b     2006-07-23
b     2006-09-28
b     2006-12-21
b     2006-12-27
b     2007-07-23
b     2007-09-21
b     2007-03-15
b     2008-04-12
b     2008-06-21
b     2008-06-11
"""
data=pd.read_csv(io.StringIO(t),delimiter='     ')#5 space here
data

What I want to do is to find the every last day of the year which year begin2005-7-1) and end 2006-06-30 ,begin 2006-7-1 and end 2007-6-30...and so on . And my expected output is here:

name     date
a     2006-06-25  #the last day of the 2005/7/01 -2006/06/31
a     2007-03-15  #the last day of the 2006/7/01 -2007/06/31
a     2008-06-21  #the last day of the 2007/7/01 -2008/06/31
b     2006-06-25  #the last day of the 2005/7/01 -2006/06/31
b     2007-03-15  #the last day of the 2006/7/01 -2007/06/31
b     2008-06-21  #the last day of the 2007/7/01 -2008/06/31

How can fix this ? I think I should use the custom

Bharath
  • 26,906
  • 5
  • 45
  • 85
ileadall42
  • 571
  • 2
  • 6
  • 19

4 Answers4

5

You can do this without the rollback using a single groupby:

In [11]: data.date = pd.to_datetime(data.date, format="%Y-%m-%d")

In [12]: df.groupby(["name", pd.Grouper(key="date", freq="AS-JUL")])["date"].max()
Out[12]:
name  date
a     2005-07-01   2006-06-25
      2006-07-01   2007-03-15
      2007-07-01   2008-06-21
b     2005-07-01   2006-06-25
      2006-07-01   2007-03-15
      2007-07-01   2008-06-21
Name: date, dtype: datetime64[ns]
Andy Hayden
  • 291,328
  • 80
  • 565
  • 500
  • I was using the freq and grouper the whole time din't know it can take offsets too. Super +1. I will remember this one – Bharath Oct 21 '17 at 05:00
  • Adding as another answer, as it's pretty different (a lot shorter)! – Andy Hayden Oct 21 '17 at 05:01
  • I didn't know offset aliases. Now I got a stronger grip towards them. All i was doing was 3D , 5D .... – Bharath Oct 21 '17 at 05:16
  • 1
    @Bharathshetty you should very familiar with the method I use ~ :-) – BENY Oct 21 '17 at 05:23
  • Never mean to bother,But if I want the`[2015-7-02,2016-6-25]` as a custom year, how can I do this? – ileadall42 Oct 22 '17 at 05:35
  • @Tangfeifan interesting, please ask a new question :) I am not sure at the moment, but if you ask as a new question I think someone will be able to answer :) (Just the "[2015-7-02,2016-6-25] as a custom year" part) – Andy Hayden Oct 22 '17 at 05:38
  • OK,I have been fix it ,just set the `freq=365` and specify the start and end ,is OK!Thanks for your help – ileadall42 Oct 22 '17 at 05:45
  • @Tangfeifan no problem, feel free to link to a new question here if you make a new question! – Andy Hayden Oct 22 '17 at 05:49
4

Well, this seems like a magical way!
The frequency is "AS-JUL" (which is year start frequency, starting in July).

We'll first take the beginning of each month (since you have some bad dates in there, let's just ignore them) but the critical thing is we need it to be datetime rather than string:

In [11]: pd.to_datetime(data.date.str[:7], format="%Y-%m")  # to beginning of month
Out[11]:
0    2005-08-01
1    2005-09-01
2    2005-11-01
3    2005-12-01
...

In [12]: df.date = pd.to_datetime(data.date.str[:7], format="%Y-%m")

Now here comes the magic:

In [13]: from pandas.tseries.frequencies import to_offset

In [14]: df.date.map(to_offset("AS-JUL").rollback)
Out[14]:
0    2005-07-01
1    2005-07-01
2    2005-07-01
3    2005-07-01
4    2005-07-01
5    2005-07-01
6    2006-07-01
7    2006-07-01
8    2006-07-01
9    2006-07-01
10   2007-07-01
11   2007-07-01
12   2006-07-01
13   2007-07-01
14   2007-07-01
15   2007-07-01
16   2005-07-01
17   2005-07-01
18   2005-07-01
19   2005-07-01
20   2005-07-01
21   2005-07-01
22   2006-07-01
23   2006-07-01
24   2006-07-01
25   2006-07-01
26   2007-07-01
27   2007-07-01
28   2006-07-01
29   2007-07-01
30   2007-07-01
31   2007-07-01
Name: date, dtype: datetime64[ns]

We created an offset to "AS-JUL" and rolled it back (meaning floor).
Note: For whatever reason we can't use dt.floor...


Okay, misread this part, you want the latest recorded date for each group in each period, with the corrected dates, the last part is just a groupby:

In [21]: data.date = pd.to_datetime(data.date, format="%Y-%m-%d")

In [22]: data["period_start"] = data.date.map(to_offset("AS-JUL").rollback).dt.normalize()

In [23]: data.groupby(["name", "period_start"])["date"].max()
Out[23]:
name  period_start
a     2005-07-01     2006-06-25
      2006-07-01     2007-03-15
      2007-07-01     2008-06-21
b     2005-07-01     2006-06-25
      2006-07-01     2007-03-15
      2007-07-01     2008-06-21
Name: date, dtype: datetime64[ns]
Andy Hayden
  • 291,328
  • 80
  • 565
  • 500
3

From the beautiful function to_offset @Andy suggested we can do

from pandas.tseries.frequencies import to_offset
new = data.groupby('name').apply(lambda x : x.groupby(x['date'].map(to_offset("AS-JUL"))).max())
             name       date
name date                      
a    2006-07-01    a 2006-06-25
     2007-07-01    a 2007-03-15
     2008-07-01    a 2008-06-21
b    2006-07-01    b 2006-06-25
     2007-07-01    b 2007-03-15
     2008-07-01    b 2008-06-21
Bharath
  • 26,906
  • 5
  • 45
  • 85
3

By using IntervalIndex(DF is your DataFrame)

idx=pd.IntervalIndex.from_arrays(pd.date_range(start='2005-07-01',freq='12MS',periods=12),pd.date_range(start='2006-06-30',freq='12M',periods=12),closed='both')
df=pd.DataFrame({'G':list(range(len(idx)))},index=idx)
DF.date=pd.to_datetime(DF.date)
DF['G']=df.loc[DF.date].values
DF.sort_values(['name','date']).drop_duplicates(['name','G'],keep='last')

Out[19]: 
   name       date  G
5     a 2006-06-25  0
12    a 2007-03-15  1
14    a 2008-06-21  2
21    b 2006-06-25  0
28    b 2007-03-15  1
30    b 2008-06-21  2
BENY
  • 258,262
  • 17
  • 121
  • 165
  • I have to say this is very easy to understand!Thanks to you. – ileadall42 Oct 21 '17 at 05:34
  • I do you know this is my first approach I used `beg = np.array([['{}/7/01'.format(i),'{}/6/30'.format(i+1)] for i in range(2005,2010)])` `index = pd.IntervalIndex.from_arrays(pd.to_datetime(beg[:,0]),pd.to_datetime(beg[:,1]))`. Yours is better – Bharath Oct 21 '17 at 05:35
  • @Tangfeifan Yw~ :-) – BENY Oct 21 '17 at 05:37
  • I knew grouper but not offset aliases. Its new to me and is fantastic – Bharath Oct 21 '17 at 05:38