search strings and label search terms in pandas efficiently

Question

I have about 6millions rows and 500 search terms. For each search term I want to find out which of these 6million rows partially matches with it. There are multiple matches for each search term too.

My data looks like this

import pandas as pd
searchterms = ['apple', 'orange', 'pear']
tosearch = {'ID':[1,2,3,4,5,6],'Item':['aassseapplekk', 'orangee', 'peear', 'banana', 'appleorange', 'fruitpear']}
df = pd.DataFrame(tosearch)
print(df)

I want to know which search terms returns the matches, and return something like this

ideal =  {'ID':[1,2,5,5,6],'Item':['aassseapplekk', 'orangee', 'appleorange', 'appleorange', 'fruitpear'], 'searchterms':['apple', 'orange', 'apple', 'orange', 'pear']}
ideal_df = pd.DataFrame(ideal)
print(ideal_df)

I've tried the following - not sure how it scales up to millions records and any ideas how I can find out which search item returns the result?

Trial 1: use regex pipe to search all search items

pattern = '|'.join(searchterms)
mock = df[df['Item'].str.contains(pattern)]

Trial 2: loop to append search results of each term

I'm still unsure about this loop since I don't seem to be able to substitute each item in my search list in 'i' and append to the empty list afterwards.

match = []
for i in searchterms:
    searchterm_match = df[df['Item'].str.contains(r'i', regex=False)]
    match.append(searchterm_match)

This [Answer](https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe) might be useful. — Akash senta, Jun 23 '20 at 16:52

score 1 · Accepted Answer · answered Jun 23 '20 at 16:51

1

You can use extractall to extract the pattern and then merge the result back to the df

pat = '('+'|'.join(searchterms)+')'
df.merge(df.groupby('ID')['Item'].apply(lambda x: x.str.extractall((pat))).reset_index([1,2], drop = True)[0].reset_index(name = 'searchterms'))


    ID  Item            searchterms
0   1   aassseapplekk   apple
1   2   orangee         orange
2   5   appleorange     apple
3   5   appleorange     orange
4   6   fruitpear       pear

answered Jun 23 '20 at 16:51

Vaishali

32,439
4
39
71

thanks it works on the sample dataframe. For my understanding what does `pat` do? – byc Jun 23 '20 at 17:00
pat is creating a pattern of strings by concatenating items in your list – Vaishali Jun 23 '20 at 17:05

search strings and label search terms in pandas efficiently

1 Answers1