I have about 6millions rows and 500 search terms. For each search term I want to find out which of these 6million rows partially matches with it. There are multiple matches for each search term too.
My data looks like this
import pandas as pd
searchterms = ['apple', 'orange', 'pear']
tosearch = {'ID':[1,2,3,4,5,6],'Item':['aassseapplekk', 'orangee', 'peear', 'banana', 'appleorange', 'fruitpear']}
df = pd.DataFrame(tosearch)
print(df)
I want to know which search terms returns the matches, and return something like this
ideal = {'ID':[1,2,5,5,6],'Item':['aassseapplekk', 'orangee', 'appleorange', 'appleorange', 'fruitpear'], 'searchterms':['apple', 'orange', 'apple', 'orange', 'pear']}
ideal_df = pd.DataFrame(ideal)
print(ideal_df)
I've tried the following - not sure how it scales up to millions records and any ideas how I can find out which search item returns the result?
Trial 1: use regex pipe to search all search items
pattern = '|'.join(searchterms)
mock = df[df['Item'].str.contains(pattern)]
Trial 2: loop to append search results of each term
I'm still unsure about this loop since I don't seem to be able to substitute each item in my search list in 'i' and append to the empty list afterwards.
match = []
for i in searchterms:
searchterm_match = df[df['Item'].str.contains(r'i', regex=False)]
match.append(searchterm_match)