3

I have two data frames with four and two columns. For example:

   A  B  C   D
0  4  2 320 700
1  5  7 400 800
2  2  6 300 550
3  4  6 100 300
4  5  2 250 360

and

   A  B  
0  2  4 
1  5  7
2  2  5 

I need to compare the first data frame with the second data frame and if column A and column B in the second data frame was in column A and column B in the first data frame. (order doesn't matter. it means in the first data frame in the first row A is 4, B is 2 and in the second data frame is A is 2 and B is 4 and it's not important but both numbers should be in the columns) keep the whole row in the first data frame; otherwise remove the row. so the output will be :

   A  B  C   D
0  4  2 320 700
1  5  7 400 800
2  5  2 250 360

How can I get this output (my actual data frames are so huge and can't iterate through them so need a fast efficient way)?

CryptoFool
  • 15,463
  • 4
  • 16
  • 36
Mohammad.sh
  • 147
  • 7

3 Answers3

2

I would do this by first sorting, then performing an LEFT OUTER JOIN using merge with an indicator to determine which rows to keep. Example,

u = df.loc[:, ['A', 'B']]
u.values.sort()     #  sort columns of `u` 
df2.values.sort()   #  sort columns of `df2`

df[u.merge(df2, how='left', indicator='ind').eval('ind == "both"').values]

   A  B    C    D
0  4  2  320  700
1  5  7  400  800
4  5  2  250  360

More info on joins with indicator can be found in my post: Pandas Merging 101


If you don't care about the final result being sorted or not, you can simplify this to an inner join.

df[['A', 'B']] = np.sort(df[['A', 'B']])
df2[:] = np.sort(df2)

df.merge(df2, on=['A', 'B'])

   A  B    C    D
0  2  4  320  700
1  5  7  400  800
2  2  5  250  360
cs95
  • 274,032
  • 76
  • 480
  • 537
  • A simple inner join would suffice here, or am I missing something? – Erfan Jun 09 '19 at 00:39
  • @Erfan it would not because the columns need to be sorted before merging, at the same time I don't want the final output to be sorted as well. – cs95 Jun 09 '19 at 00:40
  • @Erfan Edited for clarity. – cs95 Jun 09 '19 at 00:45
  • In the first solution, is it possible to just use the index of the inner join of `u` and `df2` for slicing? – GZ0 Jun 09 '19 at 00:48
  • @GZ0 A new index is assigned to the result if it is an inner join, so no, not as far as I know. – cs95 Jun 09 '19 at 00:54
  • @cs95 IC. Thanks. – GZ0 Jun 09 '19 at 01:06
  • @cs95 But left join cannot preserve index either. The solution breaks if the indices of `df` do not start with 0 or are non-consecutive numbers. – GZ0 Jun 09 '19 at 01:27
  • @GZ0 Indexes are irrelevant, left join preserves the order of the rows, I then use `eval` to create a boolean mask to index the original `df`. I'd suggest trying it out yourself. – cs95 Jun 09 '19 at 01:40
  • 2
    @cs95 I've tried and it breaks because the solution tries to align `df.index` with the index of the boolean mask. – GZ0 Jun 09 '19 at 01:41
  • @GZ0 Hmm, then `df[u.merge(df2, how='left', indicator='ind').eval('ind == "both"').values]` would work because that would use an array to subset. Thanks for pointing that out :) – cs95 Jun 09 '19 at 01:42
  • @cs95 OK that works. But it has to be based on the assumption that there are no duplicate rows in the sorted `u` (which could be the case for the OP's particular use scenario). Otherwise the left join would have a different size from `df`. – GZ0 Jun 09 '19 at 01:47
  • @GZ0 It's quite a leap to make without any context, but sure. Something like `df[['A', 'B']].apply(lambda x: tuple(sorted(x)), axis=1).isin([tuple(r) for r in np.sort(df2)])` would work (although quite wordy). Another option would involve dropping duplicates, merging, then somehow mapping that back to the original rows. Will wait until OP clarifies, don't really want to over-engineer this. – cs95 Jun 09 '19 at 01:52
  • Another idea: `df[np.isin(np.sort(df[['A', 'B']]), np.sort(df2)).all(axis=1)]` haven't tested it though. – cs95 Jun 09 '19 at 01:54
1

What I will do using frozenset + isin

yourdf=df[df[['A','B']].apply(frozenset,1).isin(df1.apply(frozenset,1))].copy()
   A  B    C    D
0  4  2  320  700
1  5  7  400  800
4  5  2  250  360
BENY
  • 258,262
  • 17
  • 121
  • 165
  • yeah, thought about this. it's a nice one liner but pretty slow. :p – cs95 Jun 09 '19 at 02:18
  • 1
    @cs95 yes, pretty slow :-) I totally agree :-) – BENY Jun 09 '19 at 02:19
  • 1
    @cs95 why is it slow? Please explain. The explanation will aid me in learning python better. Thanks. – mnm Jun 09 '19 at 03:54
  • @mnm Working with apply is generally slow, and working with objects is also slow because code cannot be vectorised. – cs95 Jun 09 '19 at 03:57
0

Using np.equal.outer

arr = np.equal.outer(df, df2)
df.loc[arr.any(1).all(-1).any(-1)]

Outputs

A   B   C   D
0   4   2   320 700
1   5   7   400 800
4   5   2   250 360
rafaelc
  • 48,227
  • 12
  • 46
  • 72