1

I have the following problem: my dataframes look something like this:

ID Date        Value

1 2016-06-12   2
1 2016-06-13   2.5
1 2016-06-16   4
2 2016-06-12   3
2 2016-06-15   1.5

As you can see I have missing days in my data. So I much rather want something like this:

ID Date        Value

1 2016-06-12   2
1 2016-06-13   2.5
1 2016-06-14   NaN
1 2016-06-15   NaN
1 2016-06-16   4
2 2016-06-12   3
2 2016-06-13   NaN
2 2016-06-14   NaN
2 2016-06-15   1.5

In order to solve that I did the following:

df_new = df.groupby('ID').apply(lambda x: x.set_index('Date').resample('1D').first())

This solution works, but takes about half an hour to process a large dataset. Thus, I wanted to know whether if is there a better solution?

accdias
  • 3,827
  • 2
  • 15
  • 28
Salocin R.
  • 11
  • 1
  • Can you test my both solution with your real data and I am really curious if first solution is faster/better like second. Thanks. – jezrael Dec 13 '19 at 12:23

1 Answers1

0

First idea is create all posible combinations of ID and Date values with and then merge with left join:

from  itertools import product

df['Date'] = pd.to_datetime(df['Date'])

L = list(product(df['ID'].unique(), pd.date_range(df['Date'].min(), df['Date'].max())))

df = pd.DataFrame(L, columns=['ID','Date']).merge(df, how='left')
print (df)
   ID       Date  Value
0   1 2016-06-12    2.0
1   1 2016-06-13    2.5
2   1 2016-06-14    NaN
3   1 2016-06-15    NaN
4   1 2016-06-16    4.0
5   2 2016-06-12    3.0
6   2 2016-06-13    NaN
7   2 2016-06-14    NaN
8   2 2016-06-15    1.5
9   2 2016-06-16    NaN

Or use DataFrame.reindex, but performance should be worse, depends of data:

df['Date'] = pd.to_datetime(df['Date'])

mux = pd.MultiIndex.from_product([df['ID'].unique(), 
                                  pd.date_range(df['Date'].min(), df['Date'].max())],
                                  names=['ID','Date'])

df = df.set_index(['ID','Date']).reindex(mux).reset_index()
print (df)
   ID       Date  Value
0   1 2016-06-12    2.0
1   1 2016-06-13    2.5
2   1 2016-06-14    NaN
3   1 2016-06-15    NaN
4   1 2016-06-16    4.0
5   2 2016-06-12    3.0
6   2 2016-06-13    NaN
7   2 2016-06-14    NaN
8   2 2016-06-15    1.5
9   2 2016-06-16    NaN
jezrael
  • 629,482
  • 62
  • 918
  • 895
  • Sorry for taking so long. It appears that your first solution does not work for me, since the created list is so large that it leads to a memory error on my side (7500 days * 15000 firms). – Salocin R. Dec 13 '19 at 15:20
  • Will try the second solution tonight. – Salocin R. Dec 13 '19 at 15:21