I have a dataframe called movie_df that has more than 3000 values of title, score, and rating. Titles are unique. Scores are 0.0 - 10.0. Ratings are either PG-13, G, R, or X. They are sorted by their rating, then ascending score.
I want to find the highest rated title per rating. The highest rated title doesn't have an equal rating with another title.
title | score | rating |
---|---|---|
avengers | 5.4 | PG-13 |
captain america | 6.7 | PG-13 |
iron man | 8.6 | PG-13 |
... | ... | ... |
spiderman | 7 | R |
daredevil | 8.2 | R |
deadpool | 10 | R |
Expected output: PG-13 : Iron Man, R : Deadpool
I don't want to use a loop to find the highest rated title.
I tried:
movie_df.sort_values(by=['rating', 'score'], inplace=True) # sort by rating, score
print(movie_df.to_string()) # to show dataframe
movie_df.groupby('rating').max()
It shows me the correct highest score, but the title is wrong. It shows me the max title too, but I don't want that. I want to know the title associated with the highest score.
Here is the actual data I'm using with its highest rated titles: Rated G Actual, NC-17 Actual, PG Actual, PG-13 Actual, R Actual
And the output: Rated G output, PG Output, PG-13 Output, R Output
CodePudding user response:
Let's try:
movie_df.reset_index(drop=True, inplace=True)
m=max(movie_df['score'])
print(movie_df['rating'][list(movie_df['score']).index(m)])
I think you can also use groupby()
and agg()
CodePudding user response:
I think your data isn't actually sorted right, that's why you're getting the wrong title but the right score.
Try movie_df.groupby('rating').idxmax()
and check if you're getting the right index.
CodePudding user response:
I got it. This code outputs the highest scored title per rating.
movie_df["rank"] = movie_df.groupby("rating")["score"].rank("dense", ascending=False)
movie_df[movie_df["rank"]==1.0][['title','score']]
It uses the rank function from pandas. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html
CodePudding user response:
Actually what you want is highest score per rating, you can group each rating by highest score in this way:
data = [['avengers', 5.4 ,'PG-13'],
['captain america', 6.7, 'PG-13'],
['spiderman', 7, 'R'],
['daredevil', 8.2, 'R'],
['iron man', 8.6, 'PG-13'],
['deadpool', 10, 'R']]
df = pd.DataFrame(data, columns=['title', 'score', 'rating'])
# Method 1 using lambda function
df = df.groupby('rating').apply(lambda x: x.sort_values('score', ascending = False).head(1))
print(df.reset_index(drop=True))
# Method 2
df = df.sort_values('score', ascending = False).groupby('rating').head(1)
print(df.reset_index(drop=True))
Output1:
title score rating
0 iron man 8.6 PG-13
1 deadpool 10.0 R
Output2:
title score rating
0 deadpool 10.0 R
1 iron man 8.6 PG-13