filter on a pandas dataframe column which contains frozenset of strings

Question

I've a result dataframe which I've obtained like this (ref http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/)

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
from mlxtend.frequent_patterns import apriori
file_result_df = apriori(df, min_support=0.6, use_colnames=True)
file_result_df['length'] = file_result_df['itemsets'].apply(lambda x: len(x))

file_result_df

  support itemsets    length
0 0.8 (Eggs)  1
1 1.0 (Kidney Beans)  1
2 0.6 (Milk)  1
3 0.6 (Onion) 1
4 0.6 (Yogurt)    1
5 0.8 (Eggs, Kidney Beans)    2
6 0.6 (Onion, Eggs)   2
7 0.6 (Milk, Kidney Beans)    2
8 0.6 (Onion, Kidney Beans)   2
9 0.6 (Kidney Beans, Yogurt)  2
10 0.6 (Onion, Eggs, Kidney Beans)    3

'itemsets' column contains python frozenset data. I want to filter out results which shows all rows where itemsets contain my selected string for e.g I want to show rows containing 'eggs' and result would be

  support itemsets    length
0 0.8 (Eggs)  1
5 0.8 (Eggs, Kidney Beans)    2
6 0.6 (Onion, Eggs)   2
10 0.6 (Onion, Eggs, Kidney Beans)    3

I've tried like suggested here http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/

This gives me empty df

fname = 'eggs'
file_result_df = file_result_df[ file_result_df['itemsets'] == frozenset((fname)) ]

This gives me only first row i.e

support itemsets    length
0.8 (Eggs)  1

file_result_df = file_result_df[ file_result_df['itemsets'] == {fname} ]

And this gives me error

fname = 'eggs'
file_result_df = file_result_df[file_result_df['itemsets'].str.lower().str.contains(fname)]

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-152-cb30c651c2b0> in <module>
      1 fname = 'eggs'
----> 2 result_df = result_df[result_df['itemsets'].str.lower().str.contains(fname)]

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5061         if (name in self._internal_names_set or name in self._metadata or
   5062                 name in self._accessors):
-> 5063             return object.__getattribute__(self, name)
   5064         else:
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):

/opt/conda/lib/python3.6/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
    169             # we're accessing the attribute of the class, i.e., Dataset.geo
    170             return self._accessor
--> 171         accessor_obj = self._accessor(obj)
    172         # Replace the property with the accessor object. Inspired by:
    173         # http://www.pydanny.com/cached-property.html

/opt/conda/lib/python3.6/site-packages/pandas/core/strings.py in __init__(self, data)
   1794 
   1795     def __init__(self, data):
-> 1796         self._validate(data)
   1797         self._is_categorical = is_categorical_dtype(data)
   1798 

/opt/conda/lib/python3.6/site-packages/pandas/core/strings.py in _validate(data)
   1816             # (instead of test for object dtype), but that isn't practical for
   1817             # performance reasons until we have a str dtype (GH 9343)
-> 1818             raise AttributeError("Can only use .str accessor with string "
   1819                                  "values, which use np.object_ dtype in "
   1820                                  "pandas")

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

This seems to work

file_result_df = file_result_df[file_result_df['itemsets'].astype(str).str.contains(fname)]

But when I print the df, it has converted frozenset into string which I don't want

support itemsets length
0   0.8 frozenset({'Eggs'}) 1
5   0.8 frozenset({'Eggs', 'Kidney Beans'}) 2
6   0.6 frozenset({'Onion', 'Eggs'}) 2
10  0.6 frozenset({'Onion', 'Eggs', 'Kidney Beans'}) 3

Any help is greatly appreciated. Thanks

Possible duplicate of [pandas + dataframe - select by partial string](https://stackoverflow.com/questions/11350770/pandas-dataframe-select-by-partial-string) — Teddy, Mar 07 '19 at 17:17

Frenchy · Answer 1 · 2019-03-08T09:24:07.703

2

a classic solution:

fname = 'eggs'
file_result_df = file_result_df[file_result_df['itemsets'].astype(str).str.lower().str.contains(fname)]

output:

    support                     itemsets  length
0       0.8                       (Eggs)     1.0
5       0.8         (Eggs, Kidney Beans)     2.0
6       0.6                (Onion, Eggs)     2.0
10      0.6  (Onion, Eggs, Kidney Beans)     3.0

edited Mar 08 '19 at 09:24

answered Mar 07 '19 at 17:28

Frenchy

9,646
2
9
27

This doesn't work. It gives me error. I've updated my question with the result. Please take a look. – kdas Mar 07 '19 at 18:37
Your solution is correct one but because itemset was an object type the str accessor threw error as I had to convert to str using astype(str) which fixed the issue. Thank you for the help definitely. – kdas Mar 07 '19 at 19:20

score 1 · Accepted Answer · answered Mar 07 '19 at 19:18

1

Aah! Found out the issue. The str accessor didn't work because the item is object as shown in error so I had to convert its type to str first using astype(str) and then it works

file_result_df = file_result_df[file_result_df['itemsets'].astype(str).str.contains(fname)]

This filters out the items as I expected.

answered Mar 07 '19 at 19:18

kdas

414
1
4
21

Can confirm this works (had the same problem with the same library). However, it seems to take quite a while, which makes sense since it transforms everything to string... – Thomas Oct 04 '19 at 13:21
try `str.contains(fname, regex=False)` this should be faster when `fname` is just a literal string – Berger May 15 '20 at 07:16

filter on a pandas dataframe column which contains frozenset of strings

2 Answers2