Dealing with outliers in Pandas - Substitution of values-CodePudding

I get a little confused dealing with outliers. I have a DataFrame that I need to go through and in every column that has a numeric value I need to find the outliers. If the value exceeds the outliers , I want to replace it with the np.nan value.

I think my problem is in replacing the outlier values with the np.nan value that for some reason I don't understand how to access them.

def outlier(df):
    new_df = df.copy()
    numeric_cols = new_df._get_numeric_data().columns
    for col in numeric_cols:
        q1 = np.percentile(new_df[col],25)
        q3 = np.percentile(new_df[col],75)
        IQR = q3 - q1
        lower_limit = q1 - (1.5*IQR)
        upper_limit = q3   (1.5*IQR)
        if (new_df[col][0] < lower_limit) | (new_df[col][0] > upper_limit):
            new_df[col] = np.nan
            
    return new_df

CodePudding user response：

Change:

 if (new_df[col][0] < lower_limit) | (new_df[col][0] > upper_limit):
        new_df[col] = np.nan

by DataFrame.loc:

new_df.loc[(new_df[col] < lower_limit) | (new_df[col] > upper_limit), col] = np.nan

Or:

new_df.loc[~new_df[col].between(lower_limit,upper_limit, inclusive="neither"), col] = np.nan

You can also avoid looping by numeric columns with processing all columns together and set NaNs by DataFrame.mask:

def outlier(df):
    new_df = df.copy()
    numeric_cols = new_df._get_numeric_data().columns
    
    q1 = np.percentile(new_df[numeric_cols],25, axis=0)
    q3 = np.percentile(new_df[numeric_cols],75, axis=0)
    IQR = q3 - q1
    lower_limit = q1 - (1.5*IQR)
    upper_limit = q3   (1.5*IQR)
    mask = (new_df[numeric_cols] < lower_limit) | (new_df[numeric_cols] > upper_limit)
    new_df[numeric_cols] = new_df[numeric_cols].mask(mask)
    return new_df