1

I have about 6millions rows and 500 search terms. For each search term I want to find out which of these 6million rows partially matches with it. There are multiple matches for each search term too.

My data looks like this

import pandas as pd
searchterms = ['apple', 'orange', 'pear']
tosearch = {'ID':[1,2,3,4,5,6],'Item':['aassseapplekk', 'orangee', 'peear', 'banana', 'appleorange', 'fruitpear']}
df = pd.DataFrame(tosearch)
print(df)

I want to know which search terms returns the matches, and return something like this

ideal =  {'ID':[1,2,5,5,6],'Item':['aassseapplekk', 'orangee', 'appleorange', 'appleorange', 'fruitpear'], 'searchterms':['apple', 'orange', 'apple', 'orange', 'pear']}
ideal_df = pd.DataFrame(ideal)
print(ideal_df)

I've tried the following - not sure how it scales up to millions records and any ideas how I can find out which search item returns the result?

Trial 1: use regex pipe to search all search items

pattern = '|'.join(searchterms)
mock = df[df['Item'].str.contains(pattern)]

Trial 2: loop to append search results of each term

I'm still unsure about this loop since I don't seem to be able to substitute each item in my search list in 'i' and append to the empty list afterwards.

match = []
for i in searchterms:
    searchterm_match = df[df['Item'].str.contains(r'i', regex=False)]
    match.append(searchterm_match)
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
byc
  • 73
  • 8
  • This [Answer](https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe) might be useful. – Akash senta Jun 23 '20 at 16:52

1 Answers1

1

You can use extractall to extract the pattern and then merge the result back to the df

pat = '('+'|'.join(searchterms)+')'
df.merge(df.groupby('ID')['Item'].apply(lambda x: x.str.extractall((pat))).reset_index([1,2], drop = True)[0].reset_index(name = 'searchterms'))


    ID  Item            searchterms
0   1   aassseapplekk   apple
1   2   orangee         orange
2   5   appleorange     apple
3   5   appleorange     orange
4   6   fruitpear       pear
Vaishali
  • 32,439
  • 4
  • 39
  • 71