I have a dataframe of writers like this:
writer_id | titles |
---|---|
n1 | t1,t2 |
n2 | t3,t4 |
And I have another dataframe for the titles with genre, ratings and votes like this:
title_id | genres | votes | rating |
---|---|---|---|
t1 | Drama,Action | 100 | 7.0 |
t2 | Action,Thriller | 1000 | 8.0 |
t3 | Crime ,Romance | 200 | 6.0 |
t4 | Drama,Romance | 300 | 5.0 |
Now in the new data frame, I want to have a row for each writer with columns for each genre with the count value and another column (let's call it popularity) that will apply a formula using votes and rating. So it would look like this:
writer_id | drama | action | thriller | romance | crime | popularity |
---|---|---|---|---|---|---|
n1 | 1 | 2 | 1 | 0 | 0 | 2.2 |
n2 | 0 | 1 | 0 | 2 | 1 | 4.2 |
How should I go about doing this? I have the columns with genres already created.
CodePudding user response:
First part is DataFrame.explode
by comma separated values in both DataFrames and merge
, for counts use crosstab
:
df11 = (df1.assign(title_id= df1['titles'].str.split(','))
.explode('title_id')
.drop('titles', 1))
df22 = (df2.assign(genres= df2['genres'].str.split(','))
.explode('genres'))
df = df11.merge(df22, on='title_id')
print (df)
writer_id title_id genres votes rating
0 n1 t1 Drama 100 7.0
1 n1 t1 Action 100 7.0
2 n1 t2 Action 1000 8.0
3 n1 t2 Thriller 1000 8.0
4 n2 t3 Crime 200 6.0
5 n2 t3 Romance 200 6.0
6 n2 t4 Drama 300 5.0
7 n2 t4 Romance 300 5.0
#count weight mean from df after `merge`
df4 = (df.assign(w = lambda x: x['votes'].mul(x['rating']))
.groupby('writer_id')
.sum()
.assign(w = lambda x: x['w'].div(x['votes'])))
df3 = pd.crosstab(df['writer_id'], df['genres']).assign(popularity = df4['w'])
print (df3)
genres Action Crime Drama Romance Thriller popularity
writer_id
n1 2 0 1 0 1 7.909091
n2 0 1 1 2 0 5.400000