8

I have a list:

things = ['A1','B2','C3']

I have a pandas data frame with a column containing values separated by a semicolon - some of the rows will contain matches with one of the items in the list above (it won't be a perfect match since it has other parts of a string in the column.. for example, a row in that column may have 'Wow;Here;This=A1;10001;0')

I want to save the rows that contain a match with items from the list, and then create a new data frame with those selected rows (should have the same headers). This is what I tried:

import re

for_new_df =[]

for x in df['COLUMN']:
    for mp in things:
        if df[df['COLUMN'].str.contains(mp)]:
            for_new_df.append(mp)  #This won't save the whole row - help here too, please.

This code gave me an error:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I'm very new to coding, so the more explanation and detail in your answer, the better! Thanks in advance.

Eric Coy
  • 95
  • 1
  • 6
  • For partial string search using multiple keywords, I recommend taking a look at [this answer.](https://stackoverflow.com/a/55335207) – cs95 Apr 07 '19 at 21:03

2 Answers2

11

You can avoid the loop by joining your list of words to create a regex and use str.contains:

pat = '|'.join(thing)
for_new_df = df[df['COLUMN'].str.contains(pat)]

should just work

So the regex pattern becomes: 'A1|B2|C3' and this will match anywhere in your strings that contain any of these strings

Example:

In [65]:
things = ['A1','B2','C3']
pat = '|'.join(things)
df = pd.DataFrame({'a':['Wow;Here;This=A1;10001;0', 'B2', 'asdasda', 'asdas']})
df[df['a'].str.contains(pat)]

Out[65]:
                          a
0  Wow;Here;This=A1;10001;0
1                        B2

As to why it failed:

if df[df['COLUMN'].str.contains(mp)]

this line:

df[df['COLUMN'].str.contains(mp)]

returns a df masked by the boolean array of your inner str.contains, if doesn't understand how to evaluate an array of booleans hence the error. If you think about it what should it do if you 1 True or all but one True? it expects a scalar and not an array like value.

EdChum
  • 294,303
  • 173
  • 671
  • 486
  • Sorry @EdChum - Whilst working on my solution I didn't realise I had basically duplicated yours – emmalg Jul 12 '16 at 16:47
  • @emmalg no worries, up to you what you want to do – EdChum Jul 12 '16 at 16:56
  • @EdChum How would I check to see which items from the list (things = ['A1|B1|C1']) were not found at all in column 'A'? – Eric Coy Jul 12 '16 at 19:48
  • You should post another question really – EdChum Jul 12 '16 at 20:18
  • @Yuva sorry but asking questions in comments is poor form on SO, also I don't understand what you're asking. It's best to post a new question, include your raw data, code to reproduce your df, your desired result and all attempts. – EdChum May 23 '19 at 13:39
  • @EdChum, yeah sure i understand, am doing now. Will delete this comment, as desired. Thanks – Yuva May 23 '19 at 13:43
  • I like this answer, but I don't think one could use it for updates, e.g.`df[df['COLUMN'].str.contains(mp)] = np.nan`, if someone wanted to do this. – Josiah Yoder Oct 12 '20 at 14:07
2

Pandas is actually amazing but I don't find it very easy to use. However it does have many functions designed to make life easy, including tools for searching through huge data frames.

Though it may not be a full solution to your problem, this may help set you off on the right foot. I have assumed that you know which column you are searching in, column A in my example.

import pandas as pd

df = pd.DataFrame({'A' : pd.Categorical(['Wow;Here;This=A1;10001;0', 'Another;C3;Row=Great;100', 'This;D6;Row=bad100']),
                   'B' : 'foo'})
print df #Original data frame
print
print df['A'].str.contains('A1|B2|C3')  # Boolean array showing matches for col A
print
print df[df['A'].str.contains('A1|B2|C3')]   # Matching rows

The output:

                          A    B
0  Wow;Here;This=A1;10001;0  foo
1  Another;C3;Row=Great;100  foo
2        This;D6;Row=bad100  foo

0     True
1     True
2    False
Name: A, dtype: bool

                          A    B
0  Wow;Here;This=A1;10001;0  foo
1  Another;C3;Row=Great;100  foo
emmalg
  • 336
  • 1
  • 12