1

Suppose I have a dataframe such as follows,

data

    id  URL
    1   www.pandora.com
    2   m.jcpenney.com
    3   www.youtube.com
    4   www.facebook.com

I want to grep and find particular words in the URL and create a new column in it. Suppose I want to find youtube and facebook alone here, my ideal output would be,

id  URL                 host
1   www.pandora.com     None
2   m.jcpenney.com      None
3   www.youtube.com     youtube
4   www.facebook.com    facebook

The URLs are very complex in real data set and also the number of rows are extremely high(~4M). So I want to find 3-4 particular hosts alone and identify them by a new column.

Following is my try,

for i in data['URL']:
    re.search('youtube', i)

but I am getting,

TypeError: expected string or buffer

error here. I want to create a new column in the same dataframe where I would give condition for 3-4 hosts and remaining would be none in that column. Can anybody help me?

Thanks

haimen
  • 1,785
  • 6
  • 25
  • 44
  • Check out this [question](http://stackoverflow.com/questions/11350770/pandas-dataframe-select-by-partial-string) for examples of substring searches in Pandas DataFrames. – SNygard Apr 18 '16 at 17:30
  • @SNygard Thanks for that link. Now I am able to split each of them into separate dataframes. But I want to create a new column based on different conditions. Can you help me in doing that? – haimen Apr 18 '16 at 17:48

1 Answers1

1

We'll begin by instantiating the data frame and creating a new column host.

import pandas as pd
df = pd.DataFrame({'id': [1,2,3,4],
                   'URL': ['www.pandora.com', 'm.jcpenney.com', 'www.youtube.com', 'www.facebook.com']})
df['host'] = None

At this stage, the data frame looks like this:

id  URL                 host
1   www.pandora.com     None
2   m.jcpenney.com      None
3   www.youtube.com     None
4   www.facebook.com    None

We can create a for loop to search for substrings in the URL column. The following code writes the df['host'] column only in the positions where the URL column contains the search item.

for item in ['youtube', 'facebook']:
    df['host'][df['URL'].str.contains(item)] = item

Now the data frame looks like this:

id  URL                 host
1   www.pandora.com     None
2   m.jcpenney.com      None
3   www.youtube.com     youtube
4   www.facebook.com    facebook

Note that you may get the following warning:

C:\Anaconda3\envs\pyvizz\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

The warning just tells us we're overwriting the data in some position. In our case, we're overwriting the None in 'host' with youtube or facebook. So in our case the warning is safely ignored.

All of the searches you've mentioned are substring searches (check if youtube is a substring, etc.). However, more complicated regex searches can be done with the findall method.

import re
for item in ['youtube', 'facebook']:
    results = df['URL'].str.findall('(%s)' % item)
    df['host'][pd.Series(map(lambda x: False if len(x) == 0 else True, results))] = item

The findall applies the regular expression (youtube) etc to the data frame. Then we map items to False if len(x) == 0. That is, items are True if and only if the regular expression found a match. Using the True/False Series as a mask, we set all items that matched to the matched value: youtube.

SNygard
  • 462
  • 9
  • 15
  • "The warning just tells us we're overwriting the data in some position." – DSM Apr 18 '16 at 21:25
  • @DSM You're right. However, the docs include this example: `dfb['c'][dfb.a.str.startswith('o')] = 42` as a way to set values. Unfortunately, it throws the warning. If you have a better method (maybe in-place operations are a bad idea?), please edit my answer or add a new one. I could learn from that! – SNygard Apr 18 '16 at 22:19