Python: Write a nested loop to test whether a series of string values is present in the column of a dataframe

Question

I have two dataframes df1 and df2. df1 has a column called 'comments' that contains a string. df2 has a column called 'labels' that contains smaller strings. I am trying to write a function that searches df1['comments'] for the strings contained in df2['labels'] and creates a new variable for d1 called df1['match'] that is True if df1['comments'] contains any of the strings in df2['labels'] and False if df1['comments'] does not contain any of the strings in df2['labels'].

I'm trying to use df.str.contains('word', na=False) to solve this problem and I have managed to create the column df1['match'] searching for one specific string using the following function:

df1['match'] = df1['comment'].str.contains('mystring', na=False)

However, I struggle to write a function that iterates over all the words in df2['label'] and creates a df1['match'] with True if any of the words in df2['label'] are present and False otherwise.

This is my attempt at writing the loop:

for comment in df1['comment']:

   for word in df2['label']:
       if df1['comment'].str.contains(word, na=False)=True:
          df1['match']=True
                             #(would need something to continue to next comment if there is a match)
          else:
              df1['match']=False   #(put value as false if there none of the items in df2['label' is contained in df1['comment']``

Any help would be greately appreciated.

kait · Answer 1 · 2020-08-12T18:51:42.360

3

You can do a multiple substring search through a regex search using pipe. See this post

df1['match'] = df['comment'].str.contains('|'.join(df2['label'].values), na=False)

edited Aug 12 '20 at 18:51

answered Aug 12 '20 at 15:23

kait

1,149
7
12

Thank you, this helps! – rtzu Aug 12 '20 at 15:33

score 0 · Answer 2 · edited Aug 12 '20 at 18:53

0

I don't know how much it will help but better way to compare is below way. It's efficient.

If df1['match'] you want to mention row by row then code will need some change . But I think you got what actually you wanted.

test1=df2['label'].to_list()
test2=df1['comments'].to_list()
flag = 0
if(set(test1).issubset(set(test2))): 
    flag = 1

if (flag) : 
    df1['match']=True
else : 
    df1['match']=False

edited Aug 12 '20 at 18:53

Dan

43,452
14
75
140

answered Aug 12 '20 at 15:02

Vivs

276
9

Thanks for your answer, I tried this but it returns False for all the values in df1['match']. Don't I still have to write some kind of loop as I want df1['match']=1 if the observation contains any of the values in df2['label']? – rtzu Aug 12 '20 at 15:21

score 0 · Answer 3 · answered Aug 12 '20 at 17:51

0

Try this If this helps

df2['match'] = "False"

for idx, word in enumerate(df2['labels']):
    q = df1['comment'][idx:].str.contains(word)
    df2['match'][idx] = q[idx]

answered Aug 12 '20 at 17:51

Vedant Desai

41
4

I have posted my full code let me know if this is what you want, Thanks – Vedant Desai Aug 12 '20 at 19:29

score 0 · Answer 4 · answered Aug 12 '20 at 19:28

Here is the complete code let me know if this is what you are asking for

import pandas as pd

d = {'comment': ["abcd efgh ijk", "lmno pqrst uvwxyz", "123456789 4567895062"]}
df1 = pd.DataFrame(data=d)
print(df1)

d = {'labels': ["efgh", "pqrst", "12389"]}
df2 = pd.DataFrame(data=d)
print(df2)

df2['match'] = "False"


for idx, word in enumerate(df2['labels']):
    q = df1['comment'][idx:].str.contains(word)
    df2['match'][idx] = q[idx]

print("final df2")
print(df2)

Python: Write a nested loop to test whether a series of string values is present in the column of a dataframe

4 Answers4