I'm trying to create a df that contains the number of users that rated rated both movies from a pair for every pair of movies.
My original df
index | userID | MovieID | rating |
---|---|---|---|
0 | 1 | 0 | 4 |
1 | 2 | 1 | 1 |
2 | 2 | 0 | 3 |
3 | 3 | 2 | 2 |
4 | 3 | 1 | 2 |
5 | 1 | 1 | 2 |
6 | 2 | 2 | 3 |
What i want to achieve
movieID / movieID | 0 | 1 | 2 |
---|---|---|---|
0 | nan | 2 | 1 |
1 | 2 | nan | 2 |
2 | 1 | 2 | nan |
Currently I'm computing this df iteratively; for each unique combination of movieID's, and passing the ids to this function
def foo(id1, id2):
id1_users = set(df[df["movieID"] == id1]["userID"].to_list())
id2_users = set(df[df["movieID"] == id2]["userID"].to_list())
combined = len(id1_users & id2_users)
return combined
Is there a faster way to compute this?
CodePudding user response:
Here's an alternative way to do it. Using itertools.combinations
, we can find pairs of MovieID
s and find the set of users who rated each pair to obtain values
dictionary.
Then reformat this dictionary to get out
dictionary which we cast to a DataFrame:
from itertools import combinations
users = df.groupby('MovieID')['userID'].apply(list).to_dict()
values = {(mov1, mov2): len(set(users[mov1]).intersection(users[mov2])) for mov1, mov2 in combinations(set(df['MovieID']), 2)}
out = {}
for (i,c), v in values.items():
out.setdefault(c, {})[i] = v
out.setdefault(i, {})[c] = v
df = pd.DataFrame(out)[[0,1,2]].sort_index()
Output:
0 1 2
0 NaN 2.0 1.0
1 2.0 NaN 2.0
2 1.0 2.0 NaN
Note that this outcome is different from yours but it appears your expected outcome has a mistake because for MovieID
s 1
and 2
, userID
s 2
and 3
both rate them, so the value of the corresponding cell should be 2
not 0
.
CodePudding user response:
If you want to compute your table without loops, you should first generate a pivot_table
with any
to identify the users that voted at least once for a movie. Then use a dot
product to count the cross correlations, with eventually a numpy.fill_diagonal
to hide the self-correlations.
d = df.pivot_table(index='userID',
columns='MovieID',
values='rating',
aggfunc=any,
fill_value=False)
out = d.T.dot(d)
# optional, to remove self correlations (in place)
import numpy as np
np.fill_diagonal(out.values, np.nan)
Output:
MovieID 0 1 2
MovieID
0 NaN 2 1
1 2 NaN 2
2 1 2 NaN
Intermediate pivot table:
MovieID 0 1 2
userID
1 1 1 0
2 1 1 1
3 0 1 1