I can remove data above 95 percentile from a column using:
df[df.value < df.value.quantile(.95)]
How can I remove data above 95 percentile grouped by another column?
So if I have a dataframe like below, I want to remove row 1 because it is above 95 percentile within type A.
Row type value
1 A 100000
2 A 0.1
3 A 0.3
4 B 10
5 B 11
Edit: I would like to remove above 95 percentile data for all 'Type' - removing above 95 percentile for Type A, Type B etc
CodePudding user response:
Have you tried:
df[df['value'].lt(df.groupby('type')['value'].transform(lambda s: s.quantile(.95)))]
or, shorted form:
df[df['value'].lt(df.groupby('type')['value'].transform('quantile', .95))]
output:
Row type value
1 2 A 0.1
2 3 A 0.3
3 4 B 10.0
CodePudding user response:
A simple solution, .groupby
and .apply
to filter rows within each group:
df.groupby('type', group_keys=False).apply(
lambda g: g[g.value < g.value.quantile(.95)]
)
Row type value
1 2 A 0.1
2 3 A 0.3
3 4 B 10.0
Previous solution - if you want to remove data above 95 percentile only in type A:
vals_a = df.loc[df.type.eq('A'), 'value']
df[df.value.lt(vals_a.quantile(.95)) & df.type.eq('A') | df.type.ne('A')]
Result:
Row type value
1 2 A 0.1
2 3 A 0.3
3 4 B 10.0
4 5 B 11.0