93

These are my two dataframes saved in two variables:

> print(df.head())
>
          club_name  tr_jan  tr_dec  year
    0  ADO Den Haag    1368    1422  2010
    1  ADO Den Haag    1455    1477  2011
    2  ADO Den Haag    1461    1443  2012
    3  ADO Den Haag    1437    1383  2013
    4  ADO Den Haag    1386    1422  2014
> print(rankingdf.head())
>
           club_name  ranking  year
    0    ADO Den Haag    12    2010
    1    ADO Den Haag    13    2011
    2    ADO Den Haag    11    2012
    3    ADO Den Haag    14    2013
    4    ADO Den Haag    17    2014

I'm trying to merge these two using this code:

new_df = df.merge(ranking_df, on=['club_name', 'year'], how='left')

The how='left' is added because I have less datapoints in my ranking_df than in my standard df.

The expected behaviour is as such:

> print(new_df.head()) 
> 

      club_name  tr_jan  tr_dec  year    ranking
0  ADO Den Haag    1368    1422  2010    12
1  ADO Den Haag    1455    1477  2011    13
2  ADO Den Haag    1461    1443  2012    11
3  ADO Den Haag    1437    1383  2013    14
4  ADO Den Haag    1386    1422  2014    17

But I get this error:

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

But I do not wish to use concat since I want to merge the trees not just add them on.

Another behaviour that's weird in my mind is that my code works if I save the first df to .csv and then load that .csv into a dataframe.

The code for that:

df = pd.DataFrame(data_points, columns=['club_name', 'tr_jan', 'tr_dec', 'year'])
df.to_csv('preliminary.csv')

df = pd.read_csv('preliminary.csv', index_col=0)

ranking_df = pd.DataFrame(rankings, columns=['club_name', 'ranking', 'year'])

new_df = df.merge(ranking_df, on=['club_name', 'year'], how='left')

I think that it has to do with the index_col=0 parameter. But I have no idea to fix it without having to save it, it doesn't matter much but is kind of an annoyance that I have to do that.

Acumenus
  • 41,481
  • 14
  • 116
  • 107
PEREZje
  • 1,548
  • 1
  • 7
  • 18

7 Answers7

127

In one of your dataframes the year is a string and the other it is an int64 you can convert it first and then join (e.g. df['year']=df['year'].astype(int) or as RafaelC suggested df.year.astype(int))

Edit: Also note the comment by Anderson Zhu: Just in case you have None or missing values in one of your dataframes, you need to use Int64 instead of int. See the reference here.

Arnon Rotem-Gal-Oz
  • 23,410
  • 2
  • 43
  • 66
  • Thanks it worked. Kinda weird since I saved every year as ints. – PEREZje Jun 01 '18 at 19:40
  • 13
    why not `df.year.astype(int)`? – rafaelc Jun 01 '18 at 19:49
  • I did eventually fix it in another way, just saved all year variables into the data-frame as integers. Never figured they were strings. – PEREZje Jun 01 '18 at 19:53
  • @RafaelC that's probably better – Arnon Rotem-Gal-Oz Jun 01 '18 at 20:36
  • Hi I used below code import pandas as pd df1 = pd.read_excel(r" ") df2 = pd.read_excel(r" ") df2['changerequestnumber']=df2['changerequestnumber'].astype(int) f9=df1.merge(df3, left_on='Related CRs', right_on='changerequestnumber') f9.to_excel(r" ") but still i am getting ValueError: You are trying to merge on object and int32 columns. If you wish to proceed you should use pd.concat can you help me on this – tiru Jul 22 '19 at 05:56
  • 2
    Just in case you have None or missing values in one of your dataframes, you need to use `Int64` instead of `int`. See the reference [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html). – andersonzhu Sep 18 '20 at 16:59
53

I found that my dfs both had the same type column (str) but switching from join to merge solved the issue.

Alex Moore-Niemi
  • 1,628
  • 1
  • 19
  • 20
  • 8
    Same here. If somebody knows why, please write below :) – raummensch Jun 07 '20 at 09:55
  • Same. Very odd indeed, my only guess is that even if everything is of type `object`, when doing the join pandas tries to evaluate data types once more implicitly... But merge solved it for me too. – 15Step Jul 16 '20 at 15:38
  • 12
    @raummensch and @15Step, I had the same problem. The reason for why merge works on strings but join doesn't can be found in the answer by @MatthiasFripp here: [link](https://stackoverflow.com/questions/22676081/what-is-the-difference-between-join-and-merge-in-pandas). Basically ```df1.join(df2)``` always merges via the index of ```df2``` whereas ```df1.merge(df2)``` will merge on the column. So basically we were trying to merge based off a string and an integer, even though both columns were strings.. – Nicko Aug 15 '20 at 18:37
  • 1
    This happened with me too. Thanks for telling us. – igorkf Jan 15 '21 at 15:33
3

It happens when common column in both table are of different data type.

Example: In table1, you have date as string whereas in table2 you have date as datetime. so before merging,we need to change date to common data type.

Ashish Anand
  • 1,491
  • 14
  • 12
3

@Arnon Rotem-Gal-Oz answer is right for the most part. But I would like to point out the difference between df['year']=df['year'].astype(int) and df.year.astype(int). df.year.astype(int) returns a view of the dataframe and doesn't not explicitly change the type, atleast in pandas 0.24.2. df['year']=df['year'].astype(int) explicitly change the type because it's an assignment. I would argue that this is the safest way to permanently change the dtype of a column.

Example:

df = pd.DataFrame({'Weed': ['green crack', 'northern lights', 'girl scout cookies'], 'Qty':[10,15,3]}) df.dtypes

Weed object, Qty int64

df['Qty'].astype(str) df.dtypes

Weed object, Qty int64

Even setting the inplace arg to True doesn't help at times. I don't know why this happens though. In most cases inplace=True equals an explicit assignment.

df['Qty'].astype(str, inplace = True) df.dtypes

Weed object, Qty int64

Now the assignment,

df['Qty'] = df['Qty'].astype(str) df.dtypes

Weed object, Qty object

escha
  • 169
  • 1
  • 6
1

Additional: when you save df to .csv format, the datetime (year in this specific case) is saved as object, so you need to convert it into integer (year in this specific case) when you do the merge. That is why when you upload both df from csv files, you can do the merge easily, while above error will show up if one df is uploaded from csv files and the other is from an existing df. This is somewhat annoying, but have an easy solution if kept in mind.

CathyQian
  • 718
  • 9
  • 23
0

At first check the type of columns which you want to merge. You will see one of them is string where other one is int. Then convert it to int as following code:

df["something"] = df["something"].astype(int)

merged = df.merge[df1, on="something"]
0

this simple solution works for me

    final = pd.concat([df, rankingdf], axis=1, sort=False)

but you may need to drop some duplicate column first.