1

I have the following csv files:

file1.csv #dataframe is named dfFile1
Id,name,pos_neg,line
1,abc,pos,cas
2,cde,neg,work
3,efg,pos,cat
4,abc,pos,job

file2.csv #dataframe is named dfFile2
Id,ref,names,other
c10,n1,www,10.5
c11,m4,efg,5.4
c12,m5,cde,9.8
c13,m9,hhh,6.7
c14,n4,abc,12.5
c15,n9,kkk,3.4

which I converted into dataframes using pandas. I would like to obtain a third data frame that matches the rows of dfFile2 according to the unique values presented in the name field of dfFile1, and also add the pos_neg row from file 1, so I will end up with:

dfNew
Id,ref,names,other,pos_neg
c11,m4,efg,5.4,pos
c12,m5,cde,9.8,neg
c14,n4,abc,12.5,pos

So far, I have done the following:

list=[]
list=dfFile1["name"].unique()    #contains [abc,cde,efg]
dfFile2=dfFile2[dfFile2.names.isin(list)]

but I just do not know how can I merge the column pos_neg from dfFile1, I tried the following:

dfNew=dfFile2.merge(dfFile2,dfFil1[["pos_neg"]],on=dfFile2)

,but it does not work.

Any help?

Thanks

Little
  • 2,841
  • 6
  • 33
  • 57

3 Answers3

1

You were almost there, just some tweaking with the DataFrame.merge method, furthermore you need drop_duplicates here, since abc appears twice in the dfFile1.

dfNew = (
    dfFile2.merge(dfFile1[['name', 'pos_neg']], 
                  left_on='names', 
                  right_on='name')
    .drop_duplicates()
    .drop(columns='name')
)

    Id ref names  other pos_neg
0  c11  m4   efg    5.4     pos
1  c12  m5   cde    9.8     neg
2  c14  n4   abc   12.5     pos

Sidenote: in Python we don't use camelCase for variable names but lowercase with underscore camel_case. See PEP8 style guide:

Function names should be lowercase, with words separated by underscores as necessary to improve readability.

Erfan
  • 31,924
  • 5
  • 41
  • 51
  • the problem is that I need to validate that the names that appear on File1 also appear on File2, for that reason I need the unique() instruction – Little Jun 27 '20 at 20:52
  • Can you edit your expected output, else add more details to your input, because right now my output matches yours. – Erfan Jun 27 '20 at 20:53
  • thank you so much for your help, I found that in my original dataset file 1 can have duplicates like 1,abc,pos,cas and also 10,abc,neg,cow; and for that reason the merge was giving me a strange result based on your answer. This was happening because for a same name it could have two different values of pos and neg based on other factors.I will try to fix that case of file 1 first. – Little Jun 27 '20 at 22:32
1

You can iterate through your dataframe with iterrows

df3 = df2[df2.names.isin(names)]

for index, row in df3.iterrows():
    row = df[row['names'] == df['name']]['pos_neg']
    df3.loc[index,'pos_neg'] = row.iloc[0]

row.loc[0] stands for rows that has same 'name' field. Gets first of same named rows

Stannum
  • 36
  • 4
0

Try:

dfNew = dfFile2.merge(dfFile1[["name", "pos_neg"]], how="inner", left_on="names", right_on="name")

Rearranging the columns and/or renaming them shouldn't be difficult if above works.

NullDev
  • 13,073
  • 4
  • 44
  • 50
  • Thanks @NullDev, but I need to preserve the unique condition because on File1 there are repeated values on the name field – Little Jun 27 '20 at 20:49