Home > other >  Pandas: Adding column indicating outlier or not
Pandas: Adding column indicating outlier or not

Time:04-08

I need to indicate if a value is an outlier by adding a new column with True/False.

I tried the following code to return a dataframe w/o the outliers.

It is working fine, but I think there should be a shorter and more efficient way to do it. Any ideas?

def indicate_outlier(df_with_outlier):
    cols = ["Col1"]
    Q1 = df_with_outlier[cols].quantile(0.25)
    Q3 = df_with_outlier[cols].quantile(0.75)
    IQR = Q3 - Q1
    df_no_outliers = df_with_outlier[
        ~(
            (df_with_outlier[cols] < (Q1 - 1.5 * IQR))
            | (df_with_outlier[cols] > (Q3   1.5 * IQR))
        ).any(axis=1)
    ]

    df_outliers = df_with_outlier[
        (
            (df_with_outlier[cols] < (Q1 - 1.5 * IQR))
            | (df_with_outlier[cols] > (Q3   1.5 * IQR))
        ).any(axis=1)
    ]
    df_no_outliers["EXCLUDED"] = 0
    df_outliers["EXCLUDED"] = 1

    return pd.concat([df_no_outliers, df_outliers])

CodePudding user response:

Why not just create a variable "is_outlier" that checks the condition ? You could replace :

df_no_outliers = df_with_outlier[
        ~(
            (df_with_outlier[cols] < (Q1 - 1.5 * IQR))
            | (df_with_outlier[cols] > (Q3   1.5 * IQR))
        ).any(axis=1)
    ]

df_outliers = df_with_outlier[
    (
        (df_with_outlier[cols] < (Q1 - 1.5 * IQR))
        | (df_with_outlier[cols] > (Q3   1.5 * IQR))
    ).any(axis=1)
]
df_no_outliers["EXCLUDED"] = 0
df_outliers["EXCLUDED"] = 1

return pd.concat([df_no_outliers, df_outliers])

With :

df_with_outlier["is_outlier"] = (df_with_outlier[cols] < (Q1 - 1.5 * IQR)) | (df_with_outlier[cols] > (Q3   1.5 * IQR))

return df_with_outlier

CodePudding user response:

Here is a simpler logic:

def indicate_outlier(df, col):
  """
  Returns df with new binary column indicating whether
  `col` is an outlier.
  """
  Q1 = df[col].quantile(0.25)
  Q3 = df[col].quantile(0.75)
  IQR = Q3 - Q1

  df[f'Outlier_{col}'] = (df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3   1.5 * IQR))

  return df
  • Related