Rearrange a pandas data frame to create a 2d ratings matrix

Question

I'm trying to build a item-based recommendation system off of the yelp data set. I managed to process the data to an extent where I have the ratings given by all the users that reviewed a restaurant in a given state. Eventually I want to get to the point where I have a ratings matrix with restaurants on one axis and users on the other, and ratings(1-5) in the middle (zero for missing reviews).

Right now the DF looks like this:

               user_id               review_id             business_id  stars
0  Xqd0DzHaiyRqVH3WRG7  15SdjuK7DmYqUAj6rjGowg  vcNAWiLM4dR7D2nwwJ7nCA      5
1  Xqd0DzHaiyRqVH3WRG7  15SdjuK7DmYqUAj6rjGowg  vcNAWiLM4dR7D2nwwJ7nCA      5
2  H1kH6QZV7Le4zqTRNxo  RF6UnRTtG7tWMcrO2GEoAg  vcNAWiLM4dR7D2nwwJ7nCA      2
3  zvJCcrpm2yOZrxKffwG  -TsVN230RCkLYKBeLsuz7A  vcNAWiLM4dR7D2nwwJ7nCA      4
4  KBLW4wJA_fwoWmMhiHR  dNocEAyUucjT371NNND41Q  vcNAWiLM4dR7D2nwwJ7nCA      4
5  zvJCcrpm2yOZrxKffwG  ebcN2aqmNUuYNoyvQErgnA  vcNAWiLM4dR7D2nwwJ7nCA      4
6  Qrs3EICADUKNFoUq2iH  _ePLBPrkrf4bhyiKWEn4Qg  vcNAWiLM4dR7D2nwwJ7nCA      1

but I would like it to look a little bit more like this:

(4 Restaurants x 5 Users)

It would be better if you include a copy-pastable example here. You need something like a pivot but yelp dataset is really sparse so you might have memory problems. This structure may be more suitable. — ayhan, Jun 01 '16 at 18:53
Like instead of the link to the image? I'm not sure how but I can try — mmera, Jun 01 '16 at 18:56
I think the best is sample with dummy data like `df = pd.DataFrame({'A':['a','b','c','c'], 'B':['g','h','f', 'p'], 'C':[7,8,9,1]})`, try modify if need. Also dont forget add desired output. — jezrael, Jun 01 '16 at 19:04
Yes, just copy and paste first few rows of the dataframe and format it as code using the {} button. I edited the post with similar values. — ayhan, Jun 01 '16 at 19:09
Thanks ayhan! That looks great and I think much more helpful to other users. — mmera, Jun 01 '16 at 19:19

score 4 · Accepted Answer · edited May 23 '17 at 12:00

4

I think you need pivot with fillna

print (df.pivot(index='business_id', columns='user_id', values='stars').fillna(0))

If:

ValueError: Index contains duplicate entries, cannot reshape

Then use pivot_table:

print (df.pivot_table(index='business_id', columns='user_id', values='stars').fillna(0))
user_id                 H1kH6QZV7Le4zqTRNxo  KBLW4wJA_fwoWmMhiHR  \
business_id                                                        
vcNAWiLM4dR7D2nwwJ7nCA                    2                    4   

user_id                 Qrs3EICADUKNFoUq2iH  Xqd0DzHaiyRqVH3WRG7  \
business_id                                                        
vcNAWiLM4dR7D2nwwJ7nCA                    1                    5   

user_id                 zvJCcrpm2yOZrxKffwG  
business_id                                  
vcNAWiLM4dR7D2nwwJ7nCA                    4

But pivot_table uses aggfunc, default is aggfunc=np.mean if duplicates. Better explanation with sample is here and in docs.

edited May 23 '17 at 12:00

Community

1
1

answered Jun 01 '16 at 18:57

jezrael

629,482
62
918
895

I tried this and the kernel crashed. The dataset was definitely to big. I'm going to add head() to the end and see if this works. – mmera Jun 01 '16 at 19:06
Or maybe try for testing `df_small = df.head(100)` – jezrael Jun 01 '16 at 19:08
So I think this is what I need but it's not quite there yet...It's not the first solution because that gets me the reviews as the y axis. Instead I want the businesses as the y axis. In theory the second answer should work but now I'm getting an error that says: `ValueError: Index contains duplicate entries, cannot reshape` – mmera Jun 01 '16 at 19:17
I think the problem is that there are multiple users reviewing the same business so it can't use the business as an index? But how do we fix it so there is only one business to multiple users? – mmera Jun 01 '16 at 19:21
Pivot table seems to have worked fine despite the `aggfunc=np.mean` disclaimer, albeit it is a little sparse. I know this is a different question but how do we remove users who have reviewed less than say 20 restaurants? – mmera Jun 01 '16 at 19:48
I think you can use [filter](http://pandas.pydata.org/pandas-docs/stable/groupby.html#filtration). – jezrael Jun 01 '16 at 19:50
I converted the table to a numpy array using `values`. Then I simply used `count_nonzero` and removed the rows with less than 20 reviews. – mmera Jun 03 '16 at 18:23

Rearrange a pandas data frame to create a 2d ratings matrix

1 Answers1