Home > Mobile >  Remove data by percentile grouping by another column
Remove data by percentile grouping by another column

Time:09-01

I can remove data above 95 percentile from a column using:

df[df.value < df.value.quantile(.95)]

How can I remove data above 95 percentile grouped by another column?

So if I have a dataframe like below, I want to remove row 1 because it is above 95 percentile within type A.

Row  type  value 
1    A     100000
2    A     0.1
3    A     0.3
4    B     10
5    B     11

Edit: I would like to remove above 95 percentile data for all 'Type' - removing above 95 percentile for Type A, Type B etc

CodePudding user response:

Have you tried:

df[df['value'].lt(df.groupby('type')['value'].transform(lambda s: s.quantile(.95)))]

or, shorted form:

df[df['value'].lt(df.groupby('type')['value'].transform('quantile', .95))]

output:

   Row type  value
1    2    A    0.1
2    3    A    0.3
3    4    B   10.0

CodePudding user response:

A simple solution, .groupby and .apply to filter rows within each group:

df.groupby('type', group_keys=False).apply(
    lambda g: g[g.value < g.value.quantile(.95)]
)
   Row type  value
1    2    A    0.1
2    3    A    0.3
3    4    B   10.0

Previous solution - if you want to remove data above 95 percentile only in type A:

vals_a = df.loc[df.type.eq('A'), 'value']

df[df.value.lt(vals_a.quantile(.95)) & df.type.eq('A') | df.type.ne('A')] 

Result:

   Row type  value
1    2    A    0.1
2    3    A    0.3
3    4    B   10.0
4    5    B   11.0
  • Related