Understanding how pandas join works

Question

Can somebody please explain this result to me? In particular, I don't know where the NaNs come from in the result. Also, I don't know how the join will decide what row to match with what row in this case.

left_df = pd.DataFrame.from_dict({'unique_l':[0, 1, 2, 3, 4], 'join':['a', 'a', 'b','b', 'c'] })
right_df = pd.DataFrame.from_dict({'unique_r':[10, 11, 12, 13, 14], 'join':['a', 'b', 'b','c', 'c'] })

join  unique_l
0    a         0
1    a         1
2    b         2
3    b         3
4    c         4 

  join  unique_r
0    a        10
1    b        11
2    b        12
3    c        13
4    c        14


print left_df.join(right_df, on='join', rsuffix='_r')

join  unique_l join_r  unique_r
0    a         0    NaN       NaN
1    a         1    NaN       NaN
2    b         2    NaN       NaN
3    b         3    NaN       NaN
4    c         4    NaN       NaN

In general, `join` is preferable for joining by the index, while `merge` is preferable for joining by a column. — IanS, Jan 04 '17 at 16:54

score 3 · Accepted Answer · edited May 23 '17 at 12:17

The join method makes use of indices. What you want is merge:

In [6]: left_df.merge(right_df, on="join", suffixes=("_l", "_r"))
Out[6]:
  join  unique_l  unique_r
0    a         0        10
1    a         1        10
2    b         2        11
3    b         2        12
4    b         3        11
5    b         3        12
6    c         4        13
7    c         4        14

Here is a related (but, IMO, not quite a duplicate) question that explains the difference between join and merge in more detail.

Understanding how pandas join works

1 Answers1