0

I have the following code which takes the key terms listed in 'job_titles'. It then uses these terms to filter out any strings that do not contain the terms in 'job_titles' in a column called 'Jobtitle'.

The code was previously working, however now it is returning this error:

ValueError: Cannot mask with non-boolean array containing NA / NaN values

I was wondering if anyone could provide guidance on how to troubleshoot this?

Note that the dataframe is called glassdoor.

job_titles = ['data', 'analytics', 'machine learning']

# Creating masks for each job title to identify where they appear
job_masks = [glassdoor.Jobtitle.str.contains(Jobtitle, flags=re.IGNORECASE, regex=True) for Jobtitle in job_titles]
# Combining all masks where any value is True, return True
combined_mask = np.vstack(job_masks).any(axis=0)
combined_mask

# Applying the mask to the dataset
glassdoor = glassdoor[combined_mask].reset_index(drop=True)
listings_after = glassdoor.shape[0]
print(f'After refining job titles there were {listings_after} job listings.')
glassdoor.head(20)

TC1111
  • 71
  • 5
  • Add `na=False` to `str.contains` call. – cs95 Sep 07 '20 at 23:22
  • 1
    Also enhance the contains with regex not loop glassdoor.Jobtitle.str.contains('|'.join(job_titles ), flags=re.IGNORECASE, regex=True) – BENY Sep 07 '20 at 23:23
  • 1
    See [this answer](https://stackoverflow.com/a/55335207/4909087) about searching with multiple terms, I've also included guidance on how to avoid the ValueError you're seeing. – cs95 Sep 07 '20 at 23:25
  • Thankyou @cs95 - great response on the linked page – TC1111 Sep 07 '20 at 23:34
  • @TC1111 Thank you, please consider passing along an upvote if you found it useful. – cs95 Sep 07 '20 at 23:35

0 Answers0