3

There is a process in Pandas DataFrame that I am trying to do for my capstone project on the Yelp Dataset Challenge. I have found a way to do it using loops, but given the large dataset I am working with, it takes a long time. (I tried running it 24 hours, and it still was not complete.) 

Is there a more efficient way to do this in Pandas without looping? 

Note: business.categories (business is a DataFrame) provides a list of categories a business is in stored as a string (e.g. "[restaurant, entertainment, bar, nightlife]"). It is written in the format of a list bust saved as a string. 

# Creates a new DataFrame with businesses as rows and columns as categories tags with 0 or 1 depending on whether the business is that category
categories_list = []

# Makes empty values an string of empty lists. This prevents Null errors later in the code.
business.categories = business.categories.fillna('[]')

# Creates all categories as a single list. Goes through each business's list of categories and adds any unique values to the master list, categories_list
for x in range(len(business)):
    # business.categories is storing each value as a list (even though it's formatted just like a string), so this converts it to a List
    categories = eval(str(business.categories[x]))
    # Looks at each categories, adding it to categories_list if it's not already there
    for category in categories:
        if category not in categories_list:
            categories_list.append(category)

# Makes the list of categories (and business_id) the colums of the new DataFrame
categories_df = pd.DataFrame(columns = ['business_id'] + categories_list, index = business.index)

# Loops through determining whether or not each business has each category, storing this as a 1 or 0 for that category type respectivity.
for x in range(len(business)):
    for y in range(len(categories_list)):
        cat = categories_list[y]
        if cat in eval(business.categories[x]):
            categories_df[cat][x] = 1
        else:
            categories_df[cat][x] = 0

# Imports the original business_id's into the new DataFrame. This allows me to cross-reference this DataFrame with my other datasets for analysis
categories_df.business_id = business.business_id

categories_df
Jan Trienes
  • 2,321
  • 1
  • 12
  • 24
Stephen
  • 31
  • 3

2 Answers2

0

Given that the data is stored as list-like strings, I don't think you can avoid looping over the data frame (either explicitly or implicitly, using str methods) in Python speeds (this seems like an unfortunate way of storing the data. can itbe avoided upstream?). However, I have some ideas for improving the approach. Since you know the resulting index ahead of time, you could immediately start building the DataFrame without knowing all the categories in advance, something like

categories_df = pd.DataFrame(index=business.index)
for ix, categories in business.categories.items():
    for cat in eval(categories):
        categories_df.loc[ix, cat] = 1   
        # if cat is not already in the columns this will add it in, with null values in the other rows
categories_df.fillna(0, inplace=True)

If you know some or all of the categories in advance then adding them as columns initially before the loop should help as well.

Also, you could try doing categories[1:-1].split(', ') instead of eval(categories). A quick test tells me it should be around 15 times faster. To ensure the same result, you should do

for ix, categories in business.categories.items():
    for cat in categories[1:-1].split(','):
        categories_df.loc[ix, cat.strip()] = 1  

to be on the safe side, as you won't know how much white space there might be around the commas. Avoiding much of the nested looping and in statements should speed your programme up considerably.

JoeCondron
  • 7,096
  • 1
  • 22
  • 27
0

Not exactly sure what you ultimately want to do is... But

Consider the dataframe business

business = pd.DataFrame(dict(
        categories=['[cat, dog]', '[bird, cat]', '[dog, bird]']
    ))

You can convert these strings to lists with

business.categories.str.strip('[]').str.split(', ')

Or even pd.get_dummies

business.categories.str.strip('[]').str.get_dummies(', ')

   bird  cat  dog
0     0    1    1
1     1    1    0
2     1    0    1
piRSquared
  • 240,659
  • 38
  • 359
  • 510
  • @Stephen go for it. Also, don't forget to accept the answer. [**What should I do when someone answers my question?**](https://stackoverflow.com/help/someone-answers) – piRSquared Jun 12 '17 at 21:31
  • It works, but there's an issue. For the column labels, it saves the string literals including the '' or "". For example, the category Accessories gets saved as the literal "'Accessories'" with apostrophes as the first and last character of the string literal. Strings naturally with an apostrophe or quotes become unreadable. For example, the original category "Men's clothing" gets saved as '"Men's Clothing"' (the literal string value includes the quotations at the beginning and end). This means that when you try to call the method like categories['"Men's Clothing"' ], there is an error. – Stephen Jun 13 '17 at 02:48