How to get the highest value per category in a dataframe?-CodePudding

I have a dataframe called movie_df that has more than 3000 values of title, score, and rating. Titles are unique. Scores are 0.0 - 10.0. Ratings are either PG-13, G, R, or X. They are sorted by their rating, then ascending score.

I want to find the highest rated title per rating. The highest rated title doesn't have an equal rating with another title.

title	score	rating
avengers	5.4	PG-13
captain america	6.7	PG-13
iron man	8.6	PG-13
...	...	...
spiderman	7	R
daredevil	8.2	R
deadpool	10	R

Expected output: PG-13 : Iron Man, R : Deadpool

I don't want to use a loop to find the highest rated title.

I tried:

movie_df.sort_values(by=['rating', 'score'], inplace=True) # sort by rating, score
print(movie_df.to_string()) # to show dataframe
movie_df.groupby('rating').max()

It shows me the correct highest score, but the title is wrong. It shows me the max title too, but I don't want that. I want to know the title associated with the highest score.

Here is the actual data I'm using with its highest rated titles: Rated G Actual, NC-17 Actual, PG Actual, PG-13 Actual, R Actual

And the output: Rated G output, PG Output, PG-13 Output, R Output

CodePudding user response：

Let's try:

movie_df.reset_index(drop=True, inplace=True)

m=max(movie_df['score'])

print(movie_df['rating'][list(movie_df['score']).index(m)])

I think you can also use groupby() and agg()

CodePudding user response：

I think your data isn't actually sorted right, that's why you're getting the wrong title but the right score.

Try movie_df.groupby('rating').idxmax() and check if you're getting the right index.

CodePudding user response：

I got it. This code outputs the highest scored title per rating.

movie_df["rank"] = movie_df.groupby("rating")["score"].rank("dense", ascending=False)
movie_df[movie_df["rank"]==1.0][['title','score']]

It uses the rank function from pandas. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html

CodePudding user response：

Actually what you want is highest score per rating, you can group each rating by highest score in this way:

data = [['avengers', 5.4 ,'PG-13'],
['captain america', 6.7, 'PG-13'],
['spiderman', 7,    'R'],
['daredevil', 8.2, 'R'],
['iron man', 8.6, 'PG-13'],
['deadpool', 10, 'R']]
 
df = pd.DataFrame(data, columns=['title', 'score', 'rating']) 
# Method 1 using lambda function
df = df.groupby('rating').apply(lambda x: x.sort_values('score', ascending = False).head(1))
print(df.reset_index(drop=True))

# Method 2 
df = df.sort_values('score', ascending = False).groupby('rating').head(1)
print(df.reset_index(drop=True))

Output1:

      title  score rating
0  iron man    8.6  PG-13
1  deadpool   10.0      R

Output2:

      title  score rating
0  deadpool   10.0      R
1  iron man    8.6  PG-13