0

I have two Dataframes which I want to merge on the column CCOL: I am giving a short sample as follows: ( the same name Cols have the same data in both dataframes)

 df1 = 

    CCOL     ColA    ColB   ColC   ColD     ColE     ColF

     A        1.0     a      b      23       45       2.7
     B        3.0     c      q      26       34       5.9
     C        5.0     f      r      2        4.9      15.9
     D        7.0     e      s      45       2         8


 df2 = 

    CCOL     ColA    ColB   ColE   ColP    ColR     ColS

      A      1.0     a       45     vq      4       21.7
      AF     7.0     za     7.9     r       4       3.1
      D      7.0     e       2      gh      3       53.1
      GK     8.0     ut     2       qt      4      33.3

So the final output from these two dataframes I am looking for is like:

  final_df = 
    
    CCOL  ColA ColB ColC ColD ColE ColF ColP ColR ColS
    
     A     1.0   a    b    23   45   2.7  vq   4   21.7
     D     7.0   e    s    45   2     8   gh   3   53.1

I tried

    newdf = pd.merge(df1,df2,on='CCOL', how='inner')

However, this one creates ColA_x, ColA_y etc in the new df from same columns, which I do not need.

My tries:,

   newdf = df1.combine_first(df2)

However, this one creates a dataframe like the follows which gives correct data for rows with CCOL, "A" & "D" but totally wrong for the columns which df1 & df2 does not match on CCOL.

     newdf = 
      
      CCOL  ColA ColB ColC ColD ColE ColF ColP ColR ColS
       A    1.0    a    b   23  45.0  2.7   vq  4.0 21.7
       B    3.0    c    q   26  34.0  5.9    r  4.0 3.1
       C    5.0    f    r   2   4.9   15.9  gh  3.0 53.1
       D    7.0    e    s   45  2.0   8.0   qt  4.0 33.3

So not understanding how to achieve my output final_df. Any help will be very much appreciated.

Stan
  • 580
  • 4
  • 18
  • Use `df1.merge(df2, on=['CCOL', 'ColA', 'ColB', 'ColE'])` – Space Impact Aug 12 '20 at 18:39
  • Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – Space Impact Aug 12 '20 at 18:39
  • It's a little unclear. The reason it's creating `'ColA_x'` and `'ColA_y'` is because without them being merge keys there's no guarantee they are the same (hence you'd need separate values). But if they are always supposed to be the same perhaps by design in prior operations, then you can literally just do `df1.merge(df2)` to get your desired output – ALollz Aug 12 '20 at 18:40
  • No none of these works to get me only CCOl rows with A & D and with all columns – Stan Aug 12 '20 at 18:42
  • @Stan If none of those are working, then you need to provide the data that actually represents your problem. If you need the correct answer. Also with your example, my first comment answer and ALollz suggestion both works. – Space Impact Aug 12 '20 at 18:46

2 Answers2

1

Almost there, just use how parameter as 'left' and add dropna:

df1.merge(df2,how='left').dropna()

Output:

  CCOL  ColA ColB ColC  ColD  ColE  ColF ColP  ColR  ColS
0    A   1.0    a    b    23  45.0   2.7   vq   4.0  21.7
3    D   7.0    e    s    45   2.0   8.0   gh   3.0  53.1
MrNobody33
  • 6,003
  • 3
  • 17
0

Okay so I assume you want to extend your result df on columns from both df1 and df2, basing on value from one particular column (present in both df1 & df2), which will identify rows you want to extend.

So there is no merging, but concatenation on different axis, like below in my example:

df1 = pd.DataFrame({'num_legs': [2, 4, 8, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, 2, 1, 8]},
                  index=['falcon', 'dog', 'spider', 'fish'])
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8
df2 = pd.DataFrame({'num_legs2': [2, 4, 3, 0],
                   'num_wings2': [2, 7, 77, 777],
                   'num_specimen_seen2': [10, 2, 1, 8]},
                  index=['falcon', 'dog', 'spider', 'fish'])
        num_legs2  num_wings2  num_specimen_seen2
falcon          2           2                  10
dog             4           7                   2
spider          3          77                   1
fish            0         777                   8
newdf = pd.concat([df1.num_legs.loc[df1.num_legs == df2.num_legs2],
                   df1.num_wings, df2.num_wings2],
                   axis=1)
print(newdf)

        num_legs  num_wings  num_wings2
falcon       2.0          2           2
dog          4.0          0           7
fish         0.0          0         777
spider       NaN          0          77

One row is excluded, where num_legs != num_legs2. Rest of the columns I wanted to concat were provided in concat method

Please see this link for details.

EDIT: There are still nans in my result. Add this

newdf = newdf[newdf2['num_legs'].notna()]

This answer is much better honestly :)

Kraxi
  • 53
  • 1
  • 10