2

I have a dataframe in which one column has text information.

print(df):

...   | ... |  Text                         |

...   | ... |  StringA. StringB. StringC    |
...   | ... |  StringZ. StringY. StringX    |
...   | ... |  StringL. StringK. StringJ    |
...   | ... |  StringA. StringZ. StringJ    |

I also have a dictionary that has the following:

dict = {'Dogs': ['StringA', 'StringL'],'Cats': ['StringB', 'StringZ', 'StringJ'],'Birds': ['StringK', 'StringY']}

EDIT: i have about 100 dictionary Keys which each have 4+ Values.

What I am hoping to do is create extra columns in the dataframe for each Key in the dictionary and then place a "1" in the column when any of the Values from the dictionary appear.

Therefore the output i am trying to get is:

print(df):

...   | ... |  Text                         |   Dogs   |   Cats    |   Birds

...   | ... |  StringA. StringB. StringC    |   1      |   1       |   0
...   | ... |  StringZ. StringY. StringX    |   0      |   1       |   1
...   | ... |  StringL. StringK. StringJ    |   1      |   1       |   1
...   | ... |  StringA. StringZ. StringJ    |   1      |   1       |   0

EDIT: The issue is I'm not sure how to search for the values within the text column and then return a 1 if found to the Key column. Any help would be much appreciated! Thanks!

SOK
  • 1,094
  • 2
  • 12
  • Can you be more specific about what the issue is? Please see [ask], [help/on-topic]. – AMC May 01 '20 at 07:22
  • The issue is how to search for the values within the text column and then return a 1 if found to the Key column – SOK May 01 '20 at 07:28
  • In pseudo code I would loop overt the rows of the `df`. For each row check how many of the keyword values occur in the column Text, then add the the count to the keyword column in the data-frame. I would look up in the Pandas documentation how to loop over rows and add a column value to a specific row. – Bruno Vermeulen May 01 '20 at 07:28
  • @SOK Then this question is a duplicate of https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe. – AMC May 01 '20 at 07:50

2 Answers2

1
import pandas as pd

d = {'Dogs': ['StringA', 'StringL'],'Cats': ['StringB', 'StringZ', 'StringJ'],'Birds': ['StringK', 'StringY']}
df = pd.DataFrame({'Text': ['StringA. StringB. StringC', 'StringZ. StringY. StringX', 'StringL. StringK. StringJ',
                            'StringA. StringZ. StringJ']})

for k,v in d.items(): # Key, value iteration of dict
    df[k] = df.apply(lambda x: 1 if any([s in x['Text'] for s in v]) else 0, axis=1)

# Apply lambda function to each row in the new column. If any of the values in the array is present in the text, its a 1

# Output
                        Text  Dogs  Cats  Birds
0  StringA. StringB. StringC     1     1      0
1  StringZ. StringY. StringX     0     1      1
2  StringL. StringK. StringJ     1     1      1
3  StringA. StringZ. StringJ     1     1      0

This solution may be unoptimal if the Strings are large or there are many strings. In which case you may have to add an additional column with some sort of Trie data structure.

But the above solution should work for most moderate cases.

Abhishek Jebaraj
  • 1,861
  • 2
  • 14
  • 20
  • Thanks very much @Abhishek. The code works when i test on smaller sample but then i get an `TypeError: argument of type 'float' is not iterable` when i run it on the full dataframe. Any ideas why that is? it works great when i just use a sample of the main dataframe – SOK May 01 '20 at 09:44
  • Welcome. Either some value in the 'Text' column is a float. Or some value in the dict is a float instead of list. You could do df['Text']=df['Text'].astype(str) just before the loop. And also run another loop to ensure that the type of all the values in the dict is list only. – Abhishek Jebaraj May 01 '20 at 12:14
  • After i rank the `df['Text']=df['Text'].astype(str)` it worked perfectly. Thanks very much!! – SOK May 01 '20 at 13:39
1

The answer of @Abhihek is the most efficient, but just to give another solution where you loop over df first

import numpy as np
import pandas as pd

d = {
    'Dogs': ['StringA', 'StringL'],
    'Cats': ['StringB', 'StringZ', 'StringJ'],
    'Birds': ['StringK', 'StringY']
}

df = pd.DataFrame({
    'Text': [
        'StringA. StringB. StringC',
        'StringZ. StringY. StringX',
        'StringL. StringK. StringJ',
        'StringA. StringZ. StringJ'
    ]
})

for index in df.index:

    for key, s_elements in d.items():
        df.at[index, key] = (lambda: 1 if any([s in df['Text'][index] for s in s_elements]) else 0)()

# set the type to short integers for the columns that have been added
for key in d:
    df = df.astype({key: np.uint8})

print(df.head())

                        Text  Dogs  Cats  Birds
0  StringA. StringB. StringC     1     1      0
1  StringZ. StringY. StringX     0     1      1
2  StringL. StringK. StringJ     1     1      1
3  StringA. StringZ. StringJ     1     1      0
Bruno Vermeulen
  • 1,862
  • 2
  • 9
  • 18