Let's assume that I have a dataset consisting the columns, movieId, title, year and ratings. Below is a sample subset of my data:
| movieId | title | rating | year |
|:-------:|:--------:|:------:|:----:|
| 1 | abc | 3.5 | 1995 |
| 1 | abc | 3 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 3 | 1995 |
| 1 | abc | 5 | 1995 |
| 1 | abc | 3.5 | 1995 |
| 1 | abc | 4.5 | 1995 |
| 1 | abc | 0.5 | 1995 |
| 1 | abc | 3.5 | 1995 |
| 1 | abc | 4.5 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 5 | 1995 |
| 1 | abc | 4.5 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 3 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 3.5 | 1995 |
| 1 | abc | 3 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 5 | 1995 |
| 1 | abc | 4.5 | 1995 |
| 1 | abc | 5 | 1995 |
| 2 | xyz | 3 | 2000 |
| 2 | xyz | 2 | 2000 |
| 2 | xyz | 3.5 | 2000 |
| 2 | xyz | 4 | 2000 |
| 2 | xyz | 3.5 | 2000 |
| 2 | xyz | 5 | 2000 |
| 2 | xyz | 3.5 | 2000 |
| 2 | xyz | 3 | 2000 |
| 2 | xyz | 3 | 2000 |
| 2 | xyz | 2 | 2000 |
| 2 | xyz | 3.5 | 2000 |
| 2 | xyz | 3 | 2000 |
| 2 | xyz | 3 | 2000 |
| 2 | xyz | 4 | 2000 |
| 2 | xyz | 2 | 2000 |
| 2 | xyz | 3.5 | 2000 |
| 2 | xyz | 1 | 2000 |
| 3 | pqr | 3 | 1997 |
| 3 | pqr | 2 | 1997 |
| 3 | pqr | 3.5 | 1997 |
| 3 | pqr | 3.5 | 1997 |
| 3 | pqr | 3 | 1997 |
| 3 | pqr | 3 | 1997 |
| 3 | pqr | 3 | 1997 |
| 3 | pqr | 3 | 1997 |
| 3 | pqr | 4.5 | 1997 |
| 3 | pqr | 3.5 | 1997 |
| 3 | pqr | 4 | 1997 |
| 3 | pqr | 1.5 | 1997 |
| 3 | pqr | 2 | 1997 |
| 3 | pqr | 2 | 1997 |
| 3 | pqr | 2.5 | 1997 |
| 4 | def | 3 | 1999 |
| 4 | def | 2.5 | 1999 |
| 4 | def | 2.5 | 1999 |
| 4 | def | 0.5 | 1999 |
| 4 | def | 2 | 1999 |
| 4 | def | 3 | 1999 |
| 5 | movie123 | 4 | 2006 |
| 5 | movie123 | 4 | 2006 |
| 5 | movie123 | 3 | 2006 |
| 5 | movie123 | 1.5 | 2006 |
| 5 | movie123 | 3 | 2006 |
| 5 | movie123 | 2 | 2006 |
| 5 | movie123 | 2.5 | 2006 |
| 5 | movie123 | 3 | 2006 |
| 5 | movie123 | 4 | 2006 |
| 5 | movie123 | 0.5 | 2006 |
| 5 | movie123 | 1 | 2006 |
| 5 | movie123 | 3.5 | 2006 |
| 5 | movie123 | 2 | 2006 |
| 5 | movie123 | 3 | 2006 |
| 5 | movie123 | 1.5 | 2006 |
| 5 | movie123 | 2.5 | 2006 |
| 5 | movie123 | 4 | 2006 |
| 5 | movie123 | 4 | 2006 |
| 5 | movie123 | 3.5 | 2006 |
| 5 | movie123 | 3 | 2006 |
| 6 | movie456 | 4 | 2012 |
| 6 | movie456 | 3.5 | 2012 |
| 6 | movie456 | 3.5 | 2012 |
| 6 | movie456 | 4 | 2012 |
| 6 | movie456 | 5 | 2012 |
| 6 | movie456 | 2.5 | 2012 |
| 6 | movie456 | 4 | 2012 |
| 6 | movie456 | 4 | 2012 |
| 6 | movie456 | 3.5 | 2012 |
| 6 | movie456 | 5 | 2012 |
| 6 | movie456 | 2 | 2012 |
| 6 | movie456 | 4 | 2012 |
I want to calculate the average, count, minimum rating, and mean ratings across the whole dataset by defining a function. So, I first calculate the number and average of ratings for each movie.
avg_rating = df.groupby(['movieId','year'])['ratings'].agg([('Count','size'), ('Mean','mean')]).sort_values(by='Mean',ascending=False)
Since, some movies may have less number of reviews but a higher rating Vs other movies with higher number of reviews and higher rating, the analysis might get skewed. So, I want to calculate weighted average, for which I have defined a function.
# R = average for the movie (mean) = (Rating)
# v = number of ratings/reviews for the movie = (votes)
# m = minimum reviews required to be listed in the Top 250 movie list
# C = the mean rating across the whole report
def weighted_rating(R, v, m, C):
return (v/(v m))*R (m/(v m))*C
avg_rating= avg_rating.assign(wr = weighted_rating(mean, count, 500, mean(mean)))
When I run the above last line of code, I get an error: name 'mean' is not defined
The same error persists for column count: name 'count' is not defined
How can I solve this error so that my final output contains the columns movieId, year, count, mean, and wr?
CodePudding user response:
You need to use the fields from your df
:
avg_rating = avg_rating.assign(
wr=weighted_rating(avg_rating['Mean'], avg_rating['Count'], 500, mean(avg_rating['Mean'])))
So the final code could be like:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function,
unicode_literals)
import pandas as pd
from statistics import mean
def weighted_rating(R, v, m, C):
return (v / (v m)) * R (m / (v m)) * C
def main():
df = pd.read_csv('movies.csv')
avg_rating = df.groupby(['movieId', 'year'])['rating'].agg(
[('Count', 'size'), ('Mean', 'mean')]).sort_values(by='Mean', ascending=False)
avg_rating = avg_rating.assign(
wr=weighted_rating(avg_rating['Mean'], avg_rating['Count'], 500, mean(avg_rating['Mean'])))
print(avg_rating)
if __name__ == '__main__':
main()
Using the CSV:
movieId,title,rating,year
1,abc,3.5,1995
1,abc,3,1995
1,abc,4,1995
1,abc,3,1995
1,abc,5,1995
1,abc,3.5,1995
1,abc,4.5,1995
1,abc,0.5,1995
1,abc,3.5,1995
1,abc,4.5,1995
1,abc,4,1995
1,abc,5,1995
1,abc,4.5,1995
1,abc,4,1995
1,abc,4,1995
1,abc,4,1995
1,abc,4,1995
1,abc,3,1995
1,abc,4,1995
1,abc,3.5,1995
1,abc,3,1995
1,abc,4,1995
1,abc,5,1995
1,abc,4.5,1995
1,abc,5,1995
2,xyz,3,2000
2,xyz,2,2000
2,xyz,3.5,2000
2,xyz,4,2000
2,xyz,3.5,2000
2,xyz,5,2000
2,xyz,3.5,2000
2,xyz,3,2000
2,xyz,3,2000
2,xyz,2,2000
2,xyz,3.5,2000
2,xyz,3,2000
2,xyz,3,2000
2,xyz,4,2000
2,xyz,2,2000
2,xyz,3.5,2000
2,xyz,1,2000
3,pqr,3,1997
3,pqr,2,1997
3,pqr,3.5,1997
3,pqr,3.5,1997
3,pqr,3,1997
3,pqr,3,1997
3,pqr,3,1997
3,pqr,3,1997
3,pqr,4.5,1997
3,pqr,3.5,1997
3,pqr,4,1997
3,pqr,1.5,1997
3,pqr,2,1997
3,pqr,2,1997
3,pqr,2.5,1997
4,def,3,1999
4,def,2.5,1999
4,def,2.5,1999
4,def,0.5,1999
4,def,2,1999
4,def,3,1999
5,movie12,4,2006
5,movie12,4,2006
5,movie12,3,2006
5,movie12,1.5,2006
5,movie12,3,2006
5,movie12,2,2006
5,movie12,2.5,2006
5,movie12,3,2006
5,movie12,4,2006
5,movie12,0.5,2006
5,movie12,1,2006
5,movie12,3.5,2006
5,movie12,2,2006
5,movie12,3,2006
5,movie12,1.5,2006
5,movie12,2.5,2006
5,movie12,4,2006
5,movie12,4,2006
5,movie12,3.5,2006
5,movie12,3,2006
6,movie45,4,2012
6,movie45,3.5,2012
6,movie45,3.5,2012
6,movie45,4,2012
6,movie45,5,2012
6,movie45,2.5,2012
6,movie45,4,2012
6,movie45,4,2012
6,movie45,3.5,2012
6,movie45,5,2012
6,movie45,2,2012
6,movie45,4,2012