Home > OS >  How should I sort and group data based on different variables
How should I sort and group data based on different variables

Time:12-16

I have data similar to the one below where shows are given a review score.

I have also hot-encoded it with values of either 0 or 1 (true or false) based on their listed genres

show review score Action & Adventure Anime Features Anime Series British TV Shows Children & Family Movies
a 8 1 0 0 0 0
b 10 0 1 0 0 0
c 9 0 1 0 0 0
d 6 0 0 1 0 0
e 9 0 0 0 1 0
f 7 0 0 0 0 1
g 8 0 0 0 0 1
h 8 1 0 0 0 0

I am trying to group by and sort them based on their values so that it looks something like the below,

With genre coming first, review score in descending order and also count of the number of shows with that review score

genre review score count
Action & Adventure 8 2
Anime Features 10 1
9 1
Anime Series 6 1
British TV Shows 9 1
Children & Family Movies 8 1
7 1

I have tried groupby but due to the amount of columns in play, it seems unfeasible.

CodePudding user response:

You could use melt with a groupby.

df1 = df.melt(id_vars=['show', 'review score']).groupby(['variable', 'review_score'])['value'].sum()

df1 is:

variable            review score    value
Action & Adventure          6       0
Action & Adventure          7       0
Action & Adventure          8       2
Action & Adventure          9       0
Action & Adventure          10      0
Anime Features              6       0
Anime Features              7       0
Anime Features              8       0
Anime Features              9       1
Anime Features              10      1
Anime Series                6       1
Anime Series                7       0
Anime Series                8       0
Anime Series                9       0
Anime Series                10      0
British TV Shows            6       0
British TV Shows            7       0
British TV Shows            8       0
British TV Shows            9       1
British TV Shows            10      0
Children & Family Movies    6       0
Children & Family Movies    7       1
Children & Family Movies    8       1
Children & Family Movies    9       0
Children & Family Movies    10      0

Then filter df1 for counts > 0

df1 = df1[df1 > 0]

df1 is:

variable                    review score    value
Action & Adventure          8               2
Anime Features              9               1
Anime Features              10              1
Anime Series                6               1
British TV Shows            9               1
Children & Family Movies    7               1
Children & Family Movies    8               1

CodePudding user response:

grouper = df.iloc[:, 2:].apply(lambda x: x.idxmax(), axis=1)
out = df.groupby([grouper, df['review score']])['review score'].count()

out

                          review score
Action & Adventure        8               2
Anime Features            9               1
                          10              1
Anime Series              6               1
British TV Shows          9               1
Children & Family Movies  7               1
                          8               1
Name: review score, dtype: int64



out is series. if you want dataframe, use following code

out.rename_axis(['genre', 'review score']).reset_index(name='count')

    genre                       review score    count
0   Action & Adventure          8               2 
1   Anime Features              9               1
2   Anime Features              10              1
3   Anime Series                6               1
4   British TV Shows            9               1
5   Children & Family Movies    7               1
6   Children & Family Movies    8               1
  • Related