0

I have the following dataframe:

pd.DataFrame({
    'Code': ['XAW', 'PAK', 'I', 'QP', 'TOPZ', 'XAW', 'APOL'],
    'Name': ['George Truck', 'Fred Williams', 'Jessica Weir', 'Tony P.', 'John Truck', 'Liz Moama', 'Emily Truck'],
    'Color': ['Blue', 'Green', 'Green', 'Red', 'Pink', 'Blue', 'Pink']
})


    Code    Name           Color
0   XAW     George Truck   Blue
1   PAK     Fred Williams  Green
2   I       Jessica Weir   Green
3   QP      Tony P.        Red
4   TOPZ    John Truck     Pink
5   XAW     Liz Moama      Blue
6   APOL    Emily Truck    Pink

Given a keyword, such as 'blue', I would like to retrieve the following rows:

0   XAW     George Paul    Blue
5   XAW     Liz Moama      Blue

The search can contain multiple keywords, for example, 'truck pink' would return:

4   TOPZ    John Truck     Pink
6   APOL    Emily Truck    Pink

Imagine that this dataframe has half a million rows and a few extra columns. Is there a fast way I can query the entire dataframe for specific keywords?

user270199
  • 135
  • 1
  • 7
  • Why not concatenate the columns you want to look at and then just do `Series.str.contains`? – ifly6 Sep 09 '20 at 19:45
  • @ifly6 I might want to query for say just code APOL and not include rows that contain names where APOL is a substring. Perhaps some fancy regex would do this? – user270199 Sep 09 '20 at 19:52
  • 1
    Does this answer your question? [Select by partial string from a pandas DataFrame](https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe) or [this one](https://stackoverflow.com/questions/47181187/finding-string-over-multiple-columns-in-pandas) – MattR Sep 09 '20 at 19:55
  • If you don't care about substrings you could look at word boundaries: https://www.rexegg.com/regex-boundaries.html#wordboundary \bgeorge\b.+\bblue\b -> matches line 0 – Jonas Geiregat Sep 09 '20 at 20:02

1 Answers1

1

With search string s = 'truck pink', set up a search column:

t = (df['Name'] + ' ' + df['Color']).str.lower()

I force everything to lower case, because your search example doesn't seem to be case-sensitive. If you have dynamic search inputs, also force the search field to lower case. Then do searches for contains like so:

d = {}
for i in s.split(' '):
    d[i] = t.str.contains(i, na=False)

I pass na=False because otherwise, Pandas will fill NA in cases where the string column is itself NA. We don't want that behaviour. The complexity of the operation increases rapidly with the number of search terms. Also consider changing this function if you want to match whole words, because contains matches sub-strings.

Regardless, take results and reduce them with bit-wise 'and'. You need two imports:

from functools import reduce
from operator import and_

df[reduce(and_, d.values())]

And thus:

   Code         Name Color
4  TOPZ   John Truck  Pink
6  APOL  Emily Truck  Pink
ifly6
  • 2,517
  • 2
  • 19
  • 33
  • Is there a way to combine .str.contains with another method that takes for example s = 'trk' and returns all rows with 'truck' in them, i.e. a partial match? – user270199 Sep 09 '20 at 20:51
  • You're looking for fuzzy matching, that's much more difficult (computationally intensive). – ifly6 Sep 09 '20 at 20:58