Home > Blockchain >  Filter pandas dataframe to keep outliers only for every column
Filter pandas dataframe to keep outliers only for every column

Time:06-15

I have the following dataframe :

year var1 var2 var3
1970 14.52 2.88 20510
1970 12.36 5.5 22320
1970 11.85 3.12 21640
1970 18.30 6.3 25200

For each and every column (vari), I would like to keep outliers only (like with df[vari].quantile(0.99)). Every column has different meaning so the boundary condition is column dependant.

I found many similar questions but most of them deal with either a single column to filter, an common outlier boundary across all columns, or results in a dataframe with values that respect the condition for every column whereas I need them for any column.

I need to plot every column separately so I have to keep rows that have at least an outlier in one of the columns.

My idea is to replace non-outlier values in each column by NaN but I can't find a simple way to filter my data.

Expected output is like :

year var1 var2 var3
1970 14.52 NaN NaN
1970 NaN 5.5 24500
1970 NaN Nan NaN
1970 18.30 6.3 25200

I tried something using .apply and lambda with no success :

outliers = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() > 3).any(axis=1)]

This seems to return rows that have an outlier in every column.

I also tried something less elegant by looping on columns but the execution fails (probably due to erroneous logic… code below for the record) :

for col in df.columns:
    filtered_col = df[col].apply(lambda x: np.NaN if (x <= df[col].quantile(0.99) and x >= df[col].quantile(0.01)) else x )
    df[col] = filtered_col

Do you have any idea on how I can tackle this issue ?

CodePudding user response:

Your code works just fine, except that it mask the year column as well. So maybe your problem is just choosing the correct columns:

var_cols = df.columns[1:]

for col in var_cols:
    mm, MM = df[col].quantile([.01,.99])
    df[col] =df[col].where(df[col].between(mm,MM))

Output:

   year   var1  var2     var3
0  1970  14.52   NaN      NaN
1  1970  12.36  5.50  22320.0
2  1970    NaN  3.12  21640.0
3  1970    NaN   NaN      NaN

You can aslo do without the for loop:

mm = df[var_cols].quantile(.01)
MM = df[var_cols].quantile(.99)

df[var_cols] = df[var_cols].where(df[var_cols].le(MM) & df[var_cols].ge(mm))

For mean and std:

mean, std = df[var_cols].mean(), df[var_cols].std()
mm, MM = mean - 3*std, mean   3*std
df[var_cols] = df[var_cols].where(df[var_cols].le(MM) & df[var_cols].ge(mm))
  • Related