python - pandas select by partial string ValueError

Question

I have a csv that I'm loading into a dataframe. I only need the rows for which the Organization column contains a target string affiliation.

When I try to use str.contains() I get ValueError: cannot index with vector containing NA / NaN values.

I've looked at Value Error when Slicing in Pandas and pandas + dataframe - select by partial string and the following solution that both have works for me:

df = df[df['Organization'].str.contains(affiliation)==True]

or

df = df[df['Organization'].str.contains(affiliation).fillna(False)]

But, as a test, I did this:

print(len(df)) #99228
df = df[pd.notnull(df['Organization'])] #or df = df.dropna(subset=['Organization'])
print(len(df)) #99228
df = df[df['Organization'].str.contains(affiliation).fillna(False)]
print(len(df)) #1605

My question is: the ValueError I was getting without ==True or fillna(False) attached to str.contains() seems to imply that the Organization column has NaNs. But then why do I get the same sized df after keeping only the non-null Organization rows? What am I missing here?

Thanks!

score 1 · Accepted Answer · answered Nov 04 '15 at 06:29

Check the content of your column Organization. It very likely contains strings and other data types. Therefore, df['Organization'].str.contains(affiliation) results in NaNfor these values with other data types. You cannot use NaNfor indexing but need to convert it into False.

score 1 · Answer 2 · answered Nov 04 '15 at 07:13

You need to specify str.contains('affiliation', na=False). [docs]

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: s = pd.Series(['foo','bar',np.nan,'this contains affiliation','baz',np.nan])

In [4]: s.str.contains('affiliation')
Out[4]:
0    False
1    False
2      NaN
3     True
4    False
5      NaN
dtype: object

In [5]: s.str.contains('affiliation', na=False)
Out[5]:
0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

Then you can index your DataFrame with that Boolean array.

Thanks but I knew I needed to do that. I didn't understand why it was happening if it wasn't because of NaNs — kilgoretrout, Nov 05 '15 at 06:17

python - pandas select by partial string ValueError

2 Answers2