I need to indicate if a value is an outlier by adding a new column with True/False.
I tried the following code to return a dataframe w/o the outliers.
It is working fine, but I think there should be a shorter and more efficient way to do it. Any ideas?
def indicate_outlier(df_with_outlier):
cols = ["Col1"]
Q1 = df_with_outlier[cols].quantile(0.25)
Q3 = df_with_outlier[cols].quantile(0.75)
IQR = Q3 - Q1
df_no_outliers = df_with_outlier[
~(
(df_with_outlier[cols] < (Q1 - 1.5 * IQR))
| (df_with_outlier[cols] > (Q3 1.5 * IQR))
).any(axis=1)
]
df_outliers = df_with_outlier[
(
(df_with_outlier[cols] < (Q1 - 1.5 * IQR))
| (df_with_outlier[cols] > (Q3 1.5 * IQR))
).any(axis=1)
]
df_no_outliers["EXCLUDED"] = 0
df_outliers["EXCLUDED"] = 1
return pd.concat([df_no_outliers, df_outliers])
CodePudding user response:
Why not just create a variable "is_outlier" that checks the condition ? You could replace :
df_no_outliers = df_with_outlier[
~(
(df_with_outlier[cols] < (Q1 - 1.5 * IQR))
| (df_with_outlier[cols] > (Q3 1.5 * IQR))
).any(axis=1)
]
df_outliers = df_with_outlier[
(
(df_with_outlier[cols] < (Q1 - 1.5 * IQR))
| (df_with_outlier[cols] > (Q3 1.5 * IQR))
).any(axis=1)
]
df_no_outliers["EXCLUDED"] = 0
df_outliers["EXCLUDED"] = 1
return pd.concat([df_no_outliers, df_outliers])
With :
df_with_outlier["is_outlier"] = (df_with_outlier[cols] < (Q1 - 1.5 * IQR)) | (df_with_outlier[cols] > (Q3 1.5 * IQR))
return df_with_outlier
CodePudding user response:
Here is a simpler logic:
def indicate_outlier(df, col):
"""
Returns df with new binary column indicating whether
`col` is an outlier.
"""
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
df[f'Outlier_{col}'] = (df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 1.5 * IQR))
return df