0

I'm new to pandas and I would like to know how I can join two files and update existing lines, taking into account a specific column. The files have thousands of lines. For example:

  • Df_1:

    A  B  C  D 
    1  2  5  4 
    2  2  6  8  
    9  2  2  1  
    

Now, my table 2 has exactly the same columns, and I want to join the two tables replacing some rows that may be in this table and also in table 1 but where there were changes / updates in column C, and add the new lines that exist in this second table (df_2), for example:

  • Df_2:

    A  B  C  D 
    2  2  7  8  
    9  2  3  1
    3  4  6  7
    1  2  3  4  
    

So, the result I want is the union of the two tables and their update in a few rows, in a specific column, like this:

  • Df_result:

    A  B  C  D 
    1  2  5  4 
    2  2  7  8  
    9  2  3  1  
    3  4  6  7
    1  2  3  4
    

How can I do this with the merge or concatenate function? Or is there another way to get the result I want?

Thank you!

MafaldaCA
  • 3
  • 1
  • 1
    Try go through this. It may help now and the future. https://stackoverflow.com/questions/53645882/pandas-merging-101 – wwnde May 22 '20 at 09:57

1 Answers1

0

You need to have at least one column as a reference, I mean, to know what needs to change to do the update.

Assuming that in your case it is "A" and "B" in this case.

import pandas as pd
ref = ['A','B']
df_result = pd.concat([df_1, df_2], ignore_index = True)
df_result = df_result.drop_duplicates(subset=ref, keep='last')

Here a real example.

d = {'col1': [1, 2, 3], 'col2': ["a", "b", "c"], 'col3': ["aa", "bb", "cc"]}
df1 = pd.DataFrame(data=d)
d = {'col1': [1, 4, 5], 'col2': ["a", "d", "f"], 'col3': ["dd","ee", "ff"]}
df2 = pd.DataFrame(data=d)

df_result = pd.concat([df1, df2], ignore_index=True)

df_result = df_result.drop_duplicates(subset=['col1','col2'], keep='last')
df_result

enter image description here

Rafael
  • 385
  • 5
  • 10