I'm trying to remove outliers using IQR method. However, the shape of my df remains the same.
Here is the code:
def IQR_outliers(df):
Q1=df.quantile(0.25)
Q3=df.quantile(0.75)
IQR=Q3-Q1
df=df[~((df<(Q1-1.5*IQR)) | (df>(Q3 1.5*IQR)))]
return df
IQR_outliers(df['Distance'])
IQR_outliers(df['Price'])
CodePudding user response:
Your function considers the whole object that is passed, but you're only passing a single series each time you use it. You're also not capturing the output. All of these things stack on top of each to make your problem pretty complex.
So here's what I would do:
- add a
column
argument to your function - modifying the function to only consider that column when selecting rows from the entire dataframe
- pipe the dataframe to that function a couple of times
So that's:
def IQR_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
df = df.loc[lambda df: ~((df[column] < (Q1 - 1.5 * IQR)) | (df[column] > (Q3 1.5 * IQR)))]
return df
revised_df = df.pipe(IQR_outliers, 'Distance').pipe(IQR_outliers, 'Price')
Note that the way you've demonstrated this, you'll very likely drop rows where Distance is an outlier even if Price is not. If you don't want to do that, you'll need to stack your dataframe, apply this function to a groupby operation, and then optionally unstack the dataframe