-1

Say I have the following 2 pandas dataframes:

import pandas as pd

A = [174,-155,-931,301]
B = [943,847,510,16]
C = [325,914,501,884]
D = [-956,318,319,-83]

E = [767,814,43,-116]
F = [110,-784,-726,37]
G = [-41,964,-67,-207]
H = [-555,787,764,-788]

df1 = pd.DataFrame({"A": A, "B": B, "C": C, "D": D})
df2 = pd.DataFrame({"E": E, "B": F, "C": G, "D": H})

If I do concat with join=outer, I get the following resulting dataframe:

pd.concat([data1,data2], join='outer')

enter image description here

If I do df1.combine_first(df2), I get the following:

df1.set_index('B').combine_first(df2.set_index('B')).reset_index()

enter image description here

If I do pd.merge(df1, df2), I get the following which is identical to the result produced by concat:

pd.merge(data1, data2, on=['B','C','D'], how='outer')

enter image description here

And finally, if I do df1.join(df2, how='outer'), I get the following:

df1.join(df2, how='outer', on='B', lsuffix='_left', rsuffix='_right')

enter image description here

I don't fully understand how and why each produces different results.

Kristada673
  • 2,895
  • 3
  • 21
  • 59
  • 1
    Does this answer your question? [Difference(s) between merge() and concat() in pandas](https://stackoverflow.com/questions/38256104/differences-between-merge-and-concat-in-pandas) – LazyCoder Jun 08 '20 at 02:02
  • @LazyCoder Not fully, no. Because the question you posted only asks about `merge` vs `concat`, whereas my question also asks about `join` vs `combine_first` in addition to `merge` vs `concat`. – Kristada673 Jun 08 '20 at 02:05
  • You ask about multiple things. If you don't know what they are, why are you asking for "the difference"?--Find out what each is. If you think you know what they are, how is it that you need "the difference"--and what does that even mean? Either way give definitions, show & apply research & ask 1 clear specific non-duplicate question about how you are stuck on a specific point. [ask] – philipxy Jun 08 '20 at 06:45
  • Please [use text, not images/links, for text--including tables & ERDs](https://meta.stackoverflow.com/q/285551/3404097). Use images only for what cannot be expressed as text or to augment text. Include a legend/key & explanation with an image. – philipxy Jun 08 '20 at 17:46

1 Answers1

1

concat: append one dataframe to another along the given axis (default axix=0 meaning concat along index, i.e. put other dataframe below given dataframe). Data are aligned on the other axis (i.e. for default setting align columns). This is why we get NaNs in the non-matching columns 'A' and 'E'.

combine_first: replace NaNs in dataframe by existing values in other dataframe, where rows and columns are pooled (union of rows and cols from both dataframes). In your example, there are no missing values from the beginning but they emerge due to the union operation as your indices have no common entries. The order of the rows results from the sorted combined index (df1.B and df2.B).
So if there are no missing values in your dataframe you wouldn't normally use combine_first.

merge is a database-style combination of two dataframes that offers more options on how to merge (left, right, specific columns) than concat. In your example, the data of the result are identical, but there's a difference in the index between concat and merge: when merging on columns, the dataframe indices will be ignored and a new index will be created.

join merges df1 and df2 on the index of df1 and the given column (in the example 'B') of df2. In your example this is the same as pd.merge(df1, df2, left_on=df1.index, right_on='B', how='outer', suffixes=('_left', '_right')). As there's no match between the index of df1 and column 'B' of df2 there will be a lot of NaNs due to the outer join.

Stef
  • 17,395
  • 1
  • 14
  • 42