0

I have a big list of strings where some elements are substrings of other elements. For example:

list = ['car', 'cartesian', ...]

I want to check if any element of my list is present in a column of a pandas dataframe, but I want to get a match that corresponds to a whole string and not to a substring. For example, if a row in the column has this value:

'carmine is a pigment'

I don't want a match: carmine-car

Right now I'm applying the solution here:

Pandas str.contains - Search for multiple values in a string and print the values in a new column

but I'm getting partial matches.

Thanks!

Iria
  • 81
  • 4
  • The linkned solution returns the first match, therefore you simply need to sort `list` in decreasing order of the string lengths before passing it to the solution. `list.sort(key=len, reverse=True)` or something similar. – Hristo 'away' Iliev Apr 06 '20 at 14:04
  • Thanks, that worked! But I realised I had another problem: the method I use gives me also matches of substrings, and I need matches with the whole word. Any idea for that? – Iria Apr 07 '20 at 12:08
  • If you are using the regex method `.str.contains('a|b|c')`, you can add word boundaries: `.str.contains(r'\b(a|b|c)\b')`. – Hristo 'away' Iliev Apr 07 '20 at 12:31
  • I'm not using str.contains. I could use it, but the problem there is that instead of specifying the strings I'm looking for (.str.contains('a|b|c')) I'm pointing to the list: output = df[df['source'].str.contains('|'.join(my_list), regex=True)], so I don't know how to indicate word boundaries there... – Iria Apr 07 '20 at 13:03
  • `'|'.join(my_list)` produces exactly `a|b|c`. You only need to prepend `r'\b('` and append `r')\b'`. – Hristo 'away' Iliev Apr 07 '20 at 13:15

0 Answers0