I have the following dataframe :
year | var1 | var2 | var3 | … |
---|---|---|---|---|
1970 | 14.52 | 2.88 | 20510 | … |
1970 | 12.36 | 5.5 | 22320 | … |
1970 | 11.85 | 3.12 | 21640 | … |
1970 | 18.30 | 6.3 | 25200 | … |
For each and every column (vari
), I would like to keep outliers only (like with df[vari].quantile(0.99)
). Every column has different meaning so the boundary condition is column dependant.
I found many similar questions but most of them deal with either a single column to filter, an common outlier boundary across all columns, or results in a dataframe with values that respect the condition for every column whereas I need them for any column.
I need to plot every column separately so I have to keep rows that have at least an outlier in one of the columns.
My idea is to replace non-outlier values in each column by NaN but I can't find a simple way to filter my data.
Expected output is like :
year | var1 | var2 | var3 | … |
---|---|---|---|---|
1970 | 14.52 | NaN | NaN | … |
1970 | NaN | 5.5 | 24500 | … |
1970 | NaN | Nan | NaN | … |
1970 | 18.30 | 6.3 | 25200 | … |
I tried something using .apply
and lambda
with no success :
outliers = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() > 3).any(axis=1)]
This seems to return rows that have an outlier in every column.
I also tried something less elegant by looping on columns but the execution fails (probably due to erroneous logic… code below for the record) :
for col in df.columns:
filtered_col = df[col].apply(lambda x: np.NaN if (x <= df[col].quantile(0.99) and x >= df[col].quantile(0.01)) else x )
df[col] = filtered_col
Do you have any idea on how I can tackle this issue ?
CodePudding user response:
Your code works just fine, except that it mask the year
column as well. So maybe your problem is just choosing the correct columns:
var_cols = df.columns[1:]
for col in var_cols:
mm, MM = df[col].quantile([.01,.99])
df[col] =df[col].where(df[col].between(mm,MM))
Output:
year var1 var2 var3
0 1970 14.52 NaN NaN
1 1970 12.36 5.50 22320.0
2 1970 NaN 3.12 21640.0
3 1970 NaN NaN NaN
You can aslo do without the for
loop:
mm = df[var_cols].quantile(.01)
MM = df[var_cols].quantile(.99)
df[var_cols] = df[var_cols].where(df[var_cols].le(MM) & df[var_cols].ge(mm))
For mean
and std
:
mean, std = df[var_cols].mean(), df[var_cols].std()
mm, MM = mean - 3*std, mean 3*std
df[var_cols] = df[var_cols].where(df[var_cols].le(MM) & df[var_cols].ge(mm))