I have a data frame like this:
df:
ID Group Score1
1 A 45
2 A 12
3 B 33
3 C 64
. . .
. . .
15000 . .
There are thousands of IDs that are repeated and 15 groups in total. For every group, I want to display
- Number of unique IDs greater than or equal to Median Score
- Number of unique IDs greater than Score 40
- Number of unique IDs below the first quartile(0-25th percentile range) of the Score column
- Number of unique IDs above the third quartile(75 - 100th percentile range) of the Score column
So my output looks like this: Output:
Group No of unique No of unique No of unique No of unique
IDs >= Median IDs > 40 IDs below IDs above the
first quartile third quartile
A 56 25 17 45
B 33 78 28 62
C 23 36 38 33
.
.
P 39 28 26 41
Can anyone help me with this? Many Thanks!
CodePudding user response:
Hope this works. First I created the four corresponding columns first in the original dataframe. Then create df1
which is the dataframe df
with unique ID's. Finally, using pd.groupby()
you can create the desired output:
df = pd.DataFrame({'ID': list(range(100)),
'Group': list('ABCDE')*20,
'Score1': np.random.randint(10, 99, 100)})
df['No of unique IDs >= Median'] = df.Score1 >= df.Score1.median()
df['No of unique IDs > 40'] = df.Score1 > 40
df['No of unique IDs < first quartile'] = df.Score1 < df.Score1.quantile(0.25)
df['No of unique IDs > third quartile'] = df.Score1 > df.Score1.quantile(0.75)
df1 = df.groupby(['ID', 'Group']).any().reset_index()
df1.groupby('Group').agg({'No of unique IDs >= Median': 'sum',
'No of unique IDs > 40': 'sum',
'No of unique IDs < first quartile': 'sum',
'No of unique IDs > third quartile': 'sum'})