3

I want to merge two big dataframes going like this:

            loc  val
2019-09-01  0    23.2
2019-09-02  0    13.2
...
2019-11-01  0    12.9
2019-09-01  1    21.2
2019-09-01  1    26.7
...
2019-11-01  1    13.5
...
2019-09-01  4    23.4
...
2019-11-01  4    17.8

so, in other words, as index I have many datetimes for each loc, with loc running from 0 to 4.

I have 2 of these dataframes. I want to join them by the loc column BUT at the same time I want to take the indexes into consideration in an inner way. So if I have as a 2nd dataframe:

            loc  val
2019-09-02  0    54.8
2019-09-03  0    11.7
...

so the merge will be something like:

            loc  val    val
2019-09-01  0    23.2   NaN
2019-09-02  0    13.2   54.8
...

Do you know if this is possible? I would like something like this (it were possible):

df = pd.merge(df1, df2, on="loc", left_index=True, right_index=True)

I have been testing with merge but I don't know how to do it. Thanks.

David
  • 857
  • 1
  • 9
  • 25

2 Answers2

3

IIUC,

We can rename the axis to have a common index name, I tried playing around with merging on a blank index but I couldn't figure it out,

we then merge on your 'loc' column + newly named 'date' index.

you sound like you know your merges so change the behavior to match your requirement.

df.rename_axis('date',inplace=True)
df1.rename_axis('date',inplace=True)
pd.merge(df,df1,on=['loc','date'],how='left',indicator=True)
out:


           loc  val_x  val_y     _merge
date                                    
2019-09-01  0.0   23.2    NaN  left_only
2019-09-02  0.0   13.2   54.8       both
2019-11-01  0.0   12.9    NaN  left_only
2019-09-01  1.0   21.2    NaN  left_only
2019-09-01  1.0   26.7    NaN  left_only
2019-11-01  1.0   13.5    NaN  left_only
2019-09-01  4.0   23.4    NaN  left_only
2019-11-01  4.0   17.8    NaN  left_only
Umar.H
  • 18,427
  • 4
  • 26
  • 52
2

You can try with this:

df_1 = df_1.reset_index().rename(columns={'index':'dates'}) #Creates columns from the index, and then rename it to `dates`
df_2 = df_2.reset_index().rename(columns={'index':'dates'}) #Same as first line

df_output = df_1.merge(df_2,how='inner',left_on=['loc','dates'],right_on=['loc','dates']) #Finally perform the inner join based on both columns.

And this would result in the desired output. I'm working on creating the sample set to illustrate it better.

import pandas as pd
d_1 = {'index':['2019-09-02','2019-09-03'],'loc':[0,0],'val':[23.2,13.2]}
d_2 = {'index':['2019-09-02','2019-09-03','2019-09-05'],'loc':[0,0,0],'val':[54.8,10,13]}
df_1 = pd.DataFrame(d_1)
df_2 = pd.DataFrame(d_2)
df_1 = df_1.set_index('index') #This is your data
df_2 = df_2.set_index('index') #This is your data
print(df_1)
print(df_2)
df_1 = df_1.reset_index().rename(columns={'index':'dates'})
df_2 = df_2.reset_index().rename(columns={'index':'dates'})

final_df = df_2.merge(df_1,how='inner',left_on=['dates','loc'],right_on=['dates','loc'])
print(final_df)

This is the output:

        dates  loc  val_x  val_y
0  2019-09-02    0   54.8   23.2
1  2019-09-03    0   10.0   13.2

However:

For your expected output and given the information you provided, a left join will match the requierement easier. With this information:

d_1 = {'index':['2019-09-01','2019-09-02'],'loc':[0,0],'val':[23.2,13.2]}
d_2 = {'index':['2019-09-02','2019-09-03'],'loc':[0,0],'val':[54.8,11.7]}
final_df = df_2.merge(df_1,how='left',left_on=['dates','loc'],right_on=['dates','loc'])
print(final_df)

Output:

        dates  loc  val_x  val_y
0  2019-09-02    0   54.8   13.2
1  2019-09-03    0   11.7    NaN
Celius Stingher
  • 11,967
  • 4
  • 12
  • 37
  • 2
    First, your use of `reset_index` -> `rename` is overly complicated when you can just use `rename_axis` as Datanovice did. Second, you don't need to specify both `left_on` and `right_on` when they are the same. Just use `on`. Third, `how='left'` would be correct if you used the `df_1` as your left frame. Your result does not match OP. Even so, this post is useful so plus_one from me. – piRSquared Nov 14 '19 at 17:03