Home > database >  How to convert df with item ratings to df that contains number of users that rated a pair of items?
How to convert df with item ratings to df that contains number of users that rated a pair of items?

Time:01-04

I'm trying to create a df that contains the number of users that rated rated both movies from a pair for every pair of movies.

My original df

index userID MovieID rating
0 1 0 4
1 2 1 1
2 2 0 3
3 3 2 2
4 3 1 2
5 1 1 2
6 2 2 3

What i want to achieve

movieID / movieID 0 1 2
0 nan 2 1
1 2 nan 2
2 1 2 nan

Currently I'm computing this df iteratively; for each unique combination of movieID's, and passing the ids to this function

def foo(id1, id2):
  id1_users = set(df[df["movieID"] == id1]["userID"].to_list())
  id2_users = set(df[df["movieID"] == id2]["userID"].to_list())
  combined = len(id1_users & id2_users)
  return combined

Is there a faster way to compute this?

CodePudding user response:

Here's an alternative way to do it. Using itertools.combinations, we can find pairs of MovieIDs and find the set of users who rated each pair to obtain values dictionary.

Then reformat this dictionary to get out dictionary which we cast to a DataFrame:

from itertools import combinations
users = df.groupby('MovieID')['userID'].apply(list).to_dict()
values = {(mov1, mov2): len(set(users[mov1]).intersection(users[mov2])) for mov1, mov2 in combinations(set(df['MovieID']), 2)}
out = {}
for (i,c), v in values.items():
    out.setdefault(c, {})[i] = v
    out.setdefault(i, {})[c] = v
df = pd.DataFrame(out)[[0,1,2]].sort_index()

Output:

     0    1    2
0  NaN  2.0  1.0
1  2.0  NaN  2.0
2  1.0  2.0  NaN

Note that this outcome is different from yours but it appears your expected outcome has a mistake because for MovieIDs 1 and 2, userIDs 2 and 3 both rate them, so the value of the corresponding cell should be 2 not 0.

CodePudding user response:

If you want to compute your table without loops, you should first generate a pivot_table with any to identify the users that voted at least once for a movie. Then use a dot product to count the cross correlations, with eventually a numpy.fill_diagonal to hide the self-correlations.

d = df.pivot_table(index='userID',
                   columns='MovieID',
                   values='rating',
                   aggfunc=any, 
                   fill_value=False)
out = d.T.dot(d)

# optional, to remove self correlations (in place)
import numpy as np
np.fill_diagonal(out.values, np.nan)

Output:

MovieID    0    1    2
MovieID               
0        NaN    2    1
1          2  NaN    2
2          1    2  NaN

Intermediate pivot table:

MovieID  0  1  2
userID          
1        1  1  0
2        1  1  1
3        0  1  1
  •  Tags:  
  • Related