Creating new pandas dataframe from partial string match

Question

I have a relatively simple dataframe that looks like this (see below). One of the columns, "Book", is a list of strings.

My goal is to make new dataframes for each of the three distinct values in "Book". That is to say, a dataframe with every product that appears in International, every product that appears in Domestic, and Subscription.

I don't know how to make a new dataframe that is built from matching partial strings in an existing dataframe. Is there a built in functionality for this, or should I build a loop that iterates over the dataframe, and build a new one by that?

df

    Description      Book                               Product ID
0   Products      International, Domestic                 X11
1   Products      International                           X12
2   Products      Domestic                                X13
3   Products      Domestic, International                 X21
4   Services      Subscription, Domestic                  X23
5   Services      International, Domestic                 X23
6   Services      Subscription, International, Domestic   X25

I have tried using different combinations of Pandas isin functionality, but that requires you knowing the exact string you are looking for. In my case, Book column can have any order of the three values, and I have been unable to use isin successfully because of that.

An example of a loop I was trying is:

f = []
for index,row in df.iterrows():
    if "International" in row['Book']:
        f.append

However this creates an empty list and I know that is in't right. I am not so strong building loops over dataframes, and any advice is greatly appreciated.

My goal output would be dataframes that look like this:

df

    Description      Book                               Product ID
0   Products      International                           X11
1   Products      International                           X12
2   Products      International                           X21
3   Services      International                           X23
4   Services      International                           X25

And

df

    Description   Book                               Product ID
0   Products      Domestic                                X11
2   Products      Domestic                                X13
3   Products      Domestic                                X21
4   Services      Domestic                                X23
5   Services      Domestic                                X25

And same for Subscription. I have looked at multiple other SO questions and been unable to find one that would help in this situation.

@Wen I'm unfamiliar with this method. Within this context, would that mean encoding each unique value in the Book column, then using those encodings to extract new dataframes? Thank you for your comment — Warthog1, Feb 27 '18 at 18:27

score 1 · Answer 1 · answered Feb 27 '18 at 18:26

I am not sure if the code you tried actually had a chance of working. Have you tried the following:

f = []
for index,row in df.iterrows():
    if "International" in row['Book']:
        f.append(row)

Please notice f.append(row) at the end.

This is probably not the most optimal way.

I would try something of the sorts below which would give you 3 columns which are much more suitable for grouping (by df.groupby) which will give you lists of products in each of the categories.

df['International'] = df.apply(lambda r: 'International' in r['Book'])
df['Domestic'] = df.apply(lambda r: 'Domestic' in r['Book'])
df['Subscription'] = df.apply(lambda r: 'Subscription' in r['Book'])

score 1 · Answer 2 · answered Feb 27 '18 at 18:27

Ass I said in comment using get_dummies

s=df.Book.str.get_dummies(sep=',')
[df[s[x]==1].assign(Book=x) for x in s.columns]
Out[198]: 
[  Description      Book ProductID
 0    Products  Domestic       X11
 2    Products  Domestic       X13
 3    Products  Domestic       X21
 4    Services  Domestic       X23
 5    Services  Domestic       X23
 6    Services  Domestic       X25,   Description           Book ProductID
 0    Products  International       X11
 1    Products  International       X12
 3    Products  International       X21
 5    Services  International       X23
 6    Services  International       X25,   Description          Book ProductID
 4    Services  Subscription       X23
 6    Services  Subscription       X25]

score 1 · Accepted Answer · answered Feb 27 '18 at 18:32

Another way:

International:

df_international = df[df['Book'].str.contains('International')].reset_index(drop=True)
df_international.loc[:, 'Book'] = 'International'
print(df_international)
#      Description           Book Product ID
#0        Products  International        X11
#1        Products  International        X12
#2        Products  International        X21
#3        Services  International        X23
#4        Services  International        X25

Domestic:

df_domestic = df[df['Book'].str.contains('Domestic')].reset_index(drop=True)
df_domestic.loc[:, 'Book'] = 'Domestic'
print(df_domestic)
#      Description      Book Product ID
#0        Products  Domestic        X11
#1        Products  Domestic        X13
#2        Products  Domestic        X21
#3        Services  Domestic        X23
#4        Services  Domestic        X23
#5        Services  Domestic        X25

Subscription:

df_subscription = df[df['Book'].str.contains('Subscription')].reset_index(drop=True)
df_subscription.loc[:, 'Book'] = 'Subscription'
print(df_subscription)
#      Description          Book Product ID
#0        Services  Subscription        X23
#1        Services  Subscription        X25

Creating new pandas dataframe from partial string match

3 Answers3