I get a little confused dealing with outliers. I have a DataFrame that I need to go through and in every column that has a numeric value I need to find the outliers. If the value exceeds the outliers , I want to replace it with the np.nan value.
I think my problem is in replacing the outlier values with the np.nan value that for some reason I don't understand how to access them.
def outlier(df):
new_df = df.copy()
numeric_cols = new_df._get_numeric_data().columns
for col in numeric_cols:
q1 = np.percentile(new_df[col],25)
q3 = np.percentile(new_df[col],75)
IQR = q3 - q1
lower_limit = q1 - (1.5*IQR)
upper_limit = q3 (1.5*IQR)
if (new_df[col][0] < lower_limit) | (new_df[col][0] > upper_limit):
new_df[col] = np.nan
return new_df
CodePudding user response:
Change:
if (new_df[col][0] < lower_limit) | (new_df[col][0] > upper_limit):
new_df[col] = np.nan
by DataFrame.loc
:
new_df.loc[(new_df[col] < lower_limit) | (new_df[col] > upper_limit), col] = np.nan
Or:
new_df.loc[~new_df[col].between(lower_limit,upper_limit, inclusive="neither"), col] = np.nan
You can also avoid looping by numeric columns with processing all columns together and set NaN
s by DataFrame.mask
:
def outlier(df):
new_df = df.copy()
numeric_cols = new_df._get_numeric_data().columns
q1 = np.percentile(new_df[numeric_cols],25, axis=0)
q3 = np.percentile(new_df[numeric_cols],75, axis=0)
IQR = q3 - q1
lower_limit = q1 - (1.5*IQR)
upper_limit = q3 (1.5*IQR)
mask = (new_df[numeric_cols] < lower_limit) | (new_df[numeric_cols] > upper_limit)
new_df[numeric_cols] = new_df[numeric_cols].mask(mask)
return new_df