Search for "does-not-contain" on a DataFrame in pandas

Question

I've done some searching and can't figure out how to filter a dataframe by df["col"].str.contains(word), however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of !(df["col"].str.contains(word)).

Can this be done through a DataFrame method?

score 353 · Accepted Answer · edited Jan 24 '19 at 20:23

353

You can use the invert (~) operator (which acts like a not for boolean data):

new_df = df[~df["col"].str.contains(word)]

, where new_df is the copy returned by RHS.

contains also accepts a regular expression...

If the above throws a ValueError, the reason is likely because you have mixed datatypes, so use na=False:

new_df = df[~df["col"].str.contains(word, na=False)]

Or,

new_df = df[df["col"].str.contains(word) == False]

edited Jan 24 '19 at 20:23

cs95

274,032
76
480
537

answered Jun 13 '13 at 21:51

Andy Hayden

291,328
80
565
500

1

Perfect! I'm SQL-familiar with regex and thought it was different in Python - saw a lot of articles with `re.complies` and told myself I'd get to that later. Looks like I overfit the search and it's just as you say : ) – stites Jun 14 '13 at 14:58
6

Maybe a full example would be helpful: `df[~df.col.str.contains(word)]` returns a copy of the original dataframe with excluded rows matching the word. – Dennis Golomazov Jun 12 '17 at 18:03

score 60 · Answer 2 · edited Oct 16 '18 at 07:07

60

I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:

df[df["col"].str.contains('this|that')==False]

edited Oct 16 '18 at 07:07

Shaido

22,716
18
57
64

answered Dec 15 '16 at 21:10

nanselm2

957
7
10

Can it be combined like this? `df[df["col1"].str.contains('this'|'that')==False and df["col2"].str.contains('foo'|'bar')==True]`? Thanks! – tommy.carstensen May 24 '17 at 10:54
Yes, you can. The syntax is explained here: https://stackoverflow.com/questions/22086116/how-do-you-filter-pandas-dataframes-by-multiple-columns – tommy.carstensen May 24 '17 at 11:26
Not to forget that if we want to rwmove rows which contain "|" we should use "\" like `df = df[~df["col"].str.contains('\|')]` – Amir Nov 26 '19 at 08:59

score 14 · Answer 3 · answered Jan 14 '19 at 03:13

14

You can use Apply and Lambda to select rows where a column contains any thing in a list. For your scenario :

df[df["col"].apply(lambda x:x not in [word1,word2,word3])]

answered Jan 14 '19 at 03:13

Arash

694
1
7
15

score 8 · Answer 4 · answered Feb 08 '19 at 13:37

I hope the answers are already posted

I am adding the framework to find multiple words and negate those from dataFrame.

Here 'word1','word2','word3','word4' = list of patterns to search

df = DataFrame

column_a = A column name from from DataFrame df

Search_for_These_values = ['word1','word2','word3','word4'] 

pattern = '|'.join(Search_for_These_values)

result = df.loc[~(df['column_a'].str.contains(pattern, case=False)]

Shoresh · Answer 5 · 2016-11-30T18:23:20.337

7

I had to get rid of the NULL values before using the command recommended by Andy above. An example:

df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df

    first   second  third
0   myword  myword   NaN
1   myword  NaN      myword 
2   myword  myword   NaN

Now running the command:

~df["second"].str.contains(word)

I get the following error:

TypeError: bad operand type for unary ~: 'float'

I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.

edited Nov 30 '16 at 18:23

answered Nov 22 '16 at 22:06

Shoresh

2,025
12
9

1

You can also use `~df["second"].astype(str).str.contains(word)` to force conversion to `str`. See https://stackoverflow.com/questions/43568760/pandas-drop-rows-where-column-contains – David C Jan 31 '18 at 21:07
1

@Shoresh we can also use na = False as a solution of this problem – Vishav Gupta Feb 13 '20 at 10:29

score 3 · Answer 6 · answered Oct 16 '18 at 07:01

3

Additional to nanselm2's answer, you can use 0 instead of False:

df["col"].str.contains(word)==0

answered Oct 16 '18 at 07:01

U11-Forward

41,703
9
50
73

it looks like this also remove any rows with `NaN` – bshelt141 Jan 07 '19 at 20:17

Search for "does-not-contain" on a DataFrame in pandas

6 Answers6

Linked

Related