Home > database >  How to filter the groupby dataframe in a multiindexed data
How to filter the groupby dataframe in a multiindexed data

Time:09-28

I have a data frame like this:

df:
ID  Group   Score1
1     A       45       
2     A       12
3     B       33
3     C       64
.     .       .
.     .       .
15000 .       .

There are thousands of IDs that are repeated and 15 groups in total. For every group, I want to display

  1. Number of unique IDs greater than or equal to Median Score
  2. Number of unique IDs greater than Score 40
  3. Number of unique IDs below the first quartile(0-25th percentile range) of the Score column
  4. Number of unique IDs above the third quartile(75 - 100th percentile range) of the Score column

So my output looks like this: Output:

Group  No of unique  No of unique  No of unique      No of unique
      IDs >= Median  IDs > 40       IDs below         IDs above the
                                  first quartile      third quartile
 A         56           25             17                 45
 B         33           78             28                 62
 C         23           36             38                 33
 .
 .
 P         39           28             26                 41

Can anyone help me with this? Many Thanks!

CodePudding user response:

Hope this works. First I created the four corresponding columns first in the original dataframe. Then create df1 which is the dataframe df with unique ID's. Finally, using pd.groupby() you can create the desired output:

df = pd.DataFrame({'ID': list(range(100)),
                   'Group': list('ABCDE')*20,
                   'Score1': np.random.randint(10, 99, 100)})

df['No of unique IDs >= Median'] = df.Score1 >= df.Score1.median()
df['No of unique IDs > 40'] = df.Score1 > 40
df['No of unique IDs < first quartile'] = df.Score1 < df.Score1.quantile(0.25)
df['No of unique IDs > third quartile'] = df.Score1 > df.Score1.quantile(0.75)

df1 = df.groupby(['ID', 'Group']).any().reset_index()

df1.groupby('Group').agg({'No of unique IDs >= Median': 'sum',
                          'No of unique IDs > 40': 'sum',
                          'No of unique IDs < first quartile': 'sum',
                          'No of unique IDs > third quartile': 'sum'})
  • Related