90

I have 2 dataframes:

restaurant_ids_dataframe

Data columns (total 13 columns):
business_id      4503  non-null values
categories       4503  non-null values
city             4503  non-null values
full_address     4503  non-null values
latitude         4503  non-null values
longitude        4503  non-null values
name             4503  non-null values
neighborhoods    4503  non-null values
open             4503  non-null values
review_count     4503  non-null values
stars            4503  non-null values
state            4503  non-null values
type             4503  non-null values
dtypes: bool(1), float64(3), int64(1), object(8)`

and

restaurant_review_frame

Int64Index: 158430 entries, 0 to 229905
Data columns (total 8 columns):
business_id    158430  non-null values
date           158430  non-null values
review_id      158430  non-null values
stars          158430  non-null values
text           158430  non-null values
type           158430  non-null values
user_id        158430  non-null values
votes          158430  non-null values
dtypes: int64(1), object(7)

I would like to join these two DataFrames to make them into a single dataframe using the DataFrame.join() command in pandas.

I have tried the following line of code:

#the following line of code creates a left join of restaurant_ids_frame and   restaurant_review_frame on the column 'business_id'
restaurant_review_frame.join(other=restaurant_ids_dataframe,on='business_id',how='left')

But when I try this I get the following error:

Exception: columns overlap: Index([business_id, stars, type], dtype=object)

I am very new to pandas and have no clue what I am doing wrong as far as executing the join statement is concerned.

any help would be much appreciated.

smci
  • 26,085
  • 16
  • 96
  • 138
anonuser0428
  • 8,987
  • 18
  • 55
  • 81
  • Related, broader topic on pandas merging in general: [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101). – cs95 Jul 14 '20 at 10:49

3 Answers3

131

You can use merge to combine two dataframes into one:

import pandas as pd
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer')

where on specifies field name that exists in both dataframes to join on, and how defines whether its inner/outer/left/right join, with outer using 'union of keys from both frames (SQL: full outer join).' Since you have 'star' column in both dataframes, this by default will create two columns star_x and star_y in the combined dataframe. As @DanAllan mentioned for the join method, you can modify the suffixes for merge by passing it as a kwarg. Default is suffixes=('_x', '_y'). if you wanted to do something like star_restaurant_id and star_restaurant_review, you can do:

 pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer', suffixes=('_restaurant_id', '_restaurant_review'))

The parameters are explained in detail in this link.

rocarvaj
  • 446
  • 3
  • 16
mlimb
  • 1,553
  • 2
  • 11
  • 9
  • 1
    Your advice solved my problem. The only change I had to make was I did an inner merge instead of outer. i.e. how='inner' instead of outer. Thanks for your help. – anonuser0428 Sep 15 '13 at 16:56
  • 2
    how=inner|outer|left|right, how to merge, intersection of keys left and right|union(ALL) keys left and right|left keys only|right keys only| – gaoithe Oct 11 '15 at 21:05
24

Joining fails if the DataFrames have some column names in common. The simplest way around it is to include an lsuffix or rsuffix keyword like so:

restaurant_review_frame.join(restaurant_ids_dataframe, on='business_id', how='left', lsuffix="_review")

This way, the columns have distinct names. The documentation addresses this very problem.

Or, you could get around this by simply deleting the offending columns before you join. If, for example, the stars in restaurant_ids_dataframe are redundant to the stars in restaurant_review_frame, you could del restaurant_ids_dataframe['stars'].

Dan Allan
  • 29,547
  • 6
  • 63
  • 61
  • it also says 'business_id' column overlaps, isn't it supposed to overlap since that's the column I'm creating the join on? How do I address that issue? – anonuser0428 Sep 13 '13 at 23:05
  • Hey @DanAllan I tried the join method but all I get is 4503 entries in the restaurant_ids_dataframe and zero entries in the columns belonging to the restaurant_review_frame. Could you please let me know why this is happening? I have performed a left join as you suggested using your above statement but it doesn't seem to give me any items from the restaurant_review_frame for some reason. What I am looking for is to create a dataframe with all the columns from both the dataframes,joined on business_id. I also deleted the extra columns other than business_id. – anonuser0428 Sep 13 '13 at 23:18
  • If you are still interested in resolving this, please provide example that data reproduces the problem. – Dan Allan Sep 16 '13 at 15:18
17

In case anyone needs to try and merge two dataframes together on the index (instead of another column), this also works!

T1 and T2 are dataframes that have the same indices

import pandas as pd
T1 = pd.merge(T1, T2, on=T1.index, how='outer')

P.S. I had to use merge because append would fill NaNs in unnecessarily.

Firas
  • 378
  • 2
  • 9