There is a process in Pandas DataFrame that I am trying to do for my capstone project on the Yelp Dataset Challenge. I have found a way to do it using loops, but given the large dataset I am working with, it takes a long time. (I tried running it 24 hours, and it still was not complete.)
Is there a more efficient way to do this in Pandas without looping?
Note: business.categories (business is a DataFrame) provides a list of categories a business is in stored as a string (e.g. "[restaurant, entertainment, bar, nightlife]"). It is written in the format of a list bust saved as a string.
# Creates a new DataFrame with businesses as rows and columns as categories tags with 0 or 1 depending on whether the business is that category
categories_list = []
# Makes empty values an string of empty lists. This prevents Null errors later in the code.
business.categories = business.categories.fillna('[]')
# Creates all categories as a single list. Goes through each business's list of categories and adds any unique values to the master list, categories_list
for x in range(len(business)):
# business.categories is storing each value as a list (even though it's formatted just like a string), so this converts it to a List
categories = eval(str(business.categories[x]))
# Looks at each categories, adding it to categories_list if it's not already there
for category in categories:
if category not in categories_list:
categories_list.append(category)
# Makes the list of categories (and business_id) the colums of the new DataFrame
categories_df = pd.DataFrame(columns = ['business_id'] + categories_list, index = business.index)
# Loops through determining whether or not each business has each category, storing this as a 1 or 0 for that category type respectivity.
for x in range(len(business)):
for y in range(len(categories_list)):
cat = categories_list[y]
if cat in eval(business.categories[x]):
categories_df[cat][x] = 1
else:
categories_df[cat][x] = 0
# Imports the original business_id's into the new DataFrame. This allows me to cross-reference this DataFrame with my other datasets for analysis
categories_df.business_id = business.business_id
categories_df