Setup
df = pd.DataFrame(dict(
A=['I need avocado', 'something', 'useless', 'nothing'],
B=['something', 'I eat margarina', 'eat apple', 'more nothing']
))
includeKeywords = ["apple", "avocado", "bannana"]
Problem
A B
0 I need avocado something # True 'avocado' in A
1 something I eat margarina
2 useless eat apple # True 'apple' in B
3 nothing more nothing
Solution
df[df.stack().str.contains('|'.join(includeKeywords)).any(level=0)]
A B
0 I need avocado something
2 useless eat apple
Details
This produces a regex
search string. In regex
, '|'
means or
. So for a regex
search, this says match 'apple'
, 'avocado'
, or 'bannana'
kwstr = '|'.join(includeKeywords)
print(kwstr)
apple|avocado|bannana
Stacking will flatten our DataFrame
df.stack()
0 A I need avocado
B something
1 A something
B I eat margarina
2 A useless
B eat apple
3 A nothing
B more nothing
dtype: object
Fortunately, the pandas.Series.str.contains
method can handle regex
and it will produce a boolean Series
df.stack().str.contains(kwstr)
0 A True
B False
1 A False
B False
2 A False
B True
3 A False
B False
dtype: bool
At which point we can cleverly use pandas.Series.any
by suggesting it only care about level=0
mask = df.stack().str.contains(kwstr).any(level=0)
mask
0 True
1 False
2 True
3 False
dtype: bool
By using level=0
we preserved the original index in the resulting Series
. This makes it perfect for filtering df
df[mask]
A B
0 I need avocado something
2 useless eat apple