Home > Software engineering >  How to filter a grouped DataFrame by maximum values in each group using Pandas?
How to filter a grouped DataFrame by maximum values in each group using Pandas?

Time:12-15

I hope you are doing well in the current situation

I've the following dataFrame as an input:

df_0 = pd.DataFrame({"year" : [1960, 1960, 1960, 1960, 1961, 1961, 1961, 1962, 1962, 1962,],
                     "genre": ['Action', 'Crime', 'Action', 'Drama', 'Thriller', 'Thriller', 'Crime', 'Drama', 'Drama', 'Thriller'],
                     "popularity": [1.99, 0.53, 1.81, 0.23, 3.86, 3.94, 0.21, 4.30, 5.60, 0.09] })

figure 0:

        year    genre   popularity
0       1960    Action    1.99
1       1960    Crime     0.53
2       1960    Action    1.81
3       1960    Drama     0.23
4       1961    Thriller  3.86
5       1961    Thriller  3.94
6       1961    Crime     0.21
7       1962    Drama     4.30
8       1962    Drama     5.60
9       1962    Thriller  0.09

I've created a new dataFrame df_1 by grouping by values like this:

df_1 = df_0.groupby(['year','genre']).popularity.agg(['mean','max'])

figure 1:

                    mean    max
year    genre       
1960    Action      1.90    1.99
        Crime       0.53    0.53
        Drama       0.23    0.23
1961    Crime       0.21    0.21
        Thriller    3.90    3.94
1962    Drama       4.95    5.60
        Thriller    0.09    0.09

As a result, we've got a similar dataFrame as the following:

df_1 = pd.DataFrame({"year" : [1960, 1960, 1960, 1961, 1961, 1962, 1962,],
                     "genre": ['Action', 'Crime', 'Drama', 'Crime', 'Thriller', 'Drama', 'Thriller'],
                     "mean ": [1.90, 0.53, 0.23, 0.21, 3.90, 4.95, 0.09],
                     "max"  : [1.99, 0.53, 0.23, 0.21, 3.94, 5.60, 0.09] }).set_index("year")

And I'm struggling with the next steps. I would like to create the following dataFrame df_2 from df_1 (.groupby()) using only pandas functions (and no numpy or at the minimum):

df_2 = pd.DataFrame({"year" : [1960, 1961, 1962],
                     "genre": ['Action', 'Thriller', 'Drama'],
                     "mean ": [1.90, 3.90, 4.95],
                     "max"  : [1.99, 3.94, 5.60] }).set_index("year")

figure 2:

        genre     mean  max
year            
1960    Action    1.90  1.99
1961    Thriller  3.90  3.94
1962    Drama     4.95  5.60

This dataFrame df_2 collects the maximum values of each group.

Any tips?
Thank you for your support.

Stay safe

CodePudding user response:

You could try the following:

import pandas as pd

# querying the results you want from df_1 and reseting index to turn
# year and genre into columns
df_2 = df_1.query('year in [1960, 1961] and genre in ["Action", "Thriller"]').reset_index()

The result will look like this:

   year     genre  mean   max
0  1960    Action   1.9  1.99
1  1961  Thriller   3.9  3.94
  • Related