Find exact longest match from list in a dataframe column

Question

I have a big list of strings where some elements are substrings of other elements. For example:

list = ['car', 'cartesian', ...]

I want to check if any element of my list is present in a column of a pandas dataframe, but I want to get a match that corresponds to a whole string and not to a substring. For example, if a row in the column has this value:

'carmine is a pigment'

I don't want a match: carmine-car

Right now I'm applying the solution here:

Pandas str.contains - Search for multiple values in a string and print the values in a new column

but I'm getting partial matches.

Thanks!

The linkned solution returns the first match, therefore you simply need to sort `list` in decreasing order of the string lengths before passing it to the solution. `list.sort(key=len, reverse=True)` or something similar. — Hristo 'away' Iliev, Apr 06 '20 at 14:04
Thanks, that worked! But I realised I had another problem: the method I use gives me also matches of substrings, and I need matches with the whole word. Any idea for that? — Iria, Apr 07 '20 at 12:08
If you are using the regex method `.str.contains('a|b|c')`, you can add word boundaries: `.str.contains(r'\b(a|b|c)\b')`. — Hristo 'away' Iliev, Apr 07 '20 at 12:31
I'm not using str.contains. I could use it, but the problem there is that instead of specifying the strings I'm looking for (.str.contains('a|b|c')) I'm pointing to the list: output = df[df['source'].str.contains('|'.join(my_list), regex=True)], so I don't know how to indicate word boundaries there... — Iria, Apr 07 '20 at 13:03
`'|'.join(my_list)` produces exactly `a|b|c`. You only need to prepend `r'\b('` and append `r')\b'`. — Hristo 'away' Iliev, Apr 07 '20 at 13:15

Find exact longest match from list in a dataframe column

0 Answers0