I have data similar to the one below where shows are given a review score.
I have also hot-encoded it with values of either 0 or 1 (true or false) based on their listed genres
show | review score | Action & Adventure | Anime Features | Anime Series | British TV Shows | Children & Family Movies |
---|---|---|---|---|---|---|
a | 8 | 1 | 0 | 0 | 0 | 0 |
b | 10 | 0 | 1 | 0 | 0 | 0 |
c | 9 | 0 | 1 | 0 | 0 | 0 |
d | 6 | 0 | 0 | 1 | 0 | 0 |
e | 9 | 0 | 0 | 0 | 1 | 0 |
f | 7 | 0 | 0 | 0 | 0 | 1 |
g | 8 | 0 | 0 | 0 | 0 | 1 |
h | 8 | 1 | 0 | 0 | 0 | 0 |
I am trying to group by and sort them based on their values so that it looks something like the below,
With genre coming first, review score in descending order and also count of the number of shows with that review score
genre | review score | count |
---|---|---|
Action & Adventure | 8 | 2 |
Anime Features | 10 | 1 |
9 | 1 | |
Anime Series | 6 | 1 |
British TV Shows | 9 | 1 |
Children & Family Movies | 8 | 1 |
7 | 1 |
I have tried groupby but due to the amount of columns in play, it seems unfeasible.
CodePudding user response:
You could use melt with a groupby.
df1 = df.melt(id_vars=['show', 'review score']).groupby(['variable', 'review_score'])['value'].sum()
df1 is:
variable review score value
Action & Adventure 6 0
Action & Adventure 7 0
Action & Adventure 8 2
Action & Adventure 9 0
Action & Adventure 10 0
Anime Features 6 0
Anime Features 7 0
Anime Features 8 0
Anime Features 9 1
Anime Features 10 1
Anime Series 6 1
Anime Series 7 0
Anime Series 8 0
Anime Series 9 0
Anime Series 10 0
British TV Shows 6 0
British TV Shows 7 0
British TV Shows 8 0
British TV Shows 9 1
British TV Shows 10 0
Children & Family Movies 6 0
Children & Family Movies 7 1
Children & Family Movies 8 1
Children & Family Movies 9 0
Children & Family Movies 10 0
Then filter df1 for counts > 0
df1 = df1[df1 > 0]
df1 is:
variable review score value
Action & Adventure 8 2
Anime Features 9 1
Anime Features 10 1
Anime Series 6 1
British TV Shows 9 1
Children & Family Movies 7 1
Children & Family Movies 8 1
CodePudding user response:
grouper = df.iloc[:, 2:].apply(lambda x: x.idxmax(), axis=1)
out = df.groupby([grouper, df['review score']])['review score'].count()
out
review score
Action & Adventure 8 2
Anime Features 9 1
10 1
Anime Series 6 1
British TV Shows 9 1
Children & Family Movies 7 1
8 1
Name: review score, dtype: int64
out
is series. if you want dataframe, use following code
out.rename_axis(['genre', 'review score']).reset_index(name='count')
genre review score count
0 Action & Adventure 8 2
1 Anime Features 9 1
2 Anime Features 10 1
3 Anime Series 6 1
4 British TV Shows 9 1
5 Children & Family Movies 7 1
6 Children & Family Movies 8 1