1

I have two dataframes df1 and df2. df1 has a column called 'comments' that contains a string. df2 has a column called 'labels' that contains smaller strings. I am trying to write a function that searches df1['comments'] for the strings contained in df2['labels'] and creates a new variable for d1 called df1['match'] that is True if df1['comments'] contains any of the strings in df2['labels'] and False if df1['comments'] does not contain any of the strings in df2['labels'].

I'm trying to use df.str.contains('word', na=False) to solve this problem and I have managed to create the column df1['match'] searching for one specific string using the following function:

df1['match'] = df1['comment'].str.contains('mystring', na=False)

However, I struggle to write a function that iterates over all the words in df2['label'] and creates a df1['match'] with True if any of the words in df2['label'] are present and False otherwise.

This is my attempt at writing the loop:

for comment in df1['comment']:

   for word in df2['label']:
       if df1['comment'].str.contains(word, na=False)=True:
          df1['match']=True
                             #(would need something to continue to next comment if there is a match)
          else:
              df1['match']=False   #(put value as false if there none of the items in df2['label' is contained in df1['comment']``

Any help would be greately appreciated.

rtzu
  • 49
  • 5

4 Answers4

3

You can do a multiple substring search through a regex search using pipe. See this post

df1['match'] = df['comment'].str.contains('|'.join(df2['label'].values), na=False)
kait
  • 1,149
  • 7
  • 12
0

I don't know how much it will help but better way to compare is below way. It's efficient.

If df1['match'] you want to mention row by row then code will need some change . But I think you got what actually you wanted.

test1=df2['label'].to_list()
test2=df1['comments'].to_list()
flag = 0
if(set(test1).issubset(set(test2))): 
    flag = 1

if (flag) : 
    df1['match']=True
else : 
    df1['match']=False 
Dan
  • 43,452
  • 14
  • 75
  • 140
Vivs
  • 276
  • 9
  • Thanks for your answer, I tried this but it returns False for all the values in df1['match']. Don't I still have to write some kind of loop as I want df1['match']=1 if the observation contains any of the values in df2['label']? – rtzu Aug 12 '20 at 15:21
0

Try this If this helps

df2['match'] = "False"

for idx, word in enumerate(df2['labels']):
    q = df1['comment'][idx:].str.contains(word)
    df2['match'][idx] = q[idx]
0

Here is the complete code let me know if this is what you are asking for

import pandas as pd

d = {'comment': ["abcd efgh ijk", "lmno pqrst uvwxyz", "123456789 4567895062"]}
df1 = pd.DataFrame(data=d)
print(df1)

d = {'labels': ["efgh", "pqrst", "12389"]}
df2 = pd.DataFrame(data=d)
print(df2)

df2['match'] = "False"


for idx, word in enumerate(df2['labels']):
    q = df1['comment'][idx:].str.contains(word)
    df2['match'][idx] = q[idx]

print("final df2")
print(df2)