I am trying to filter my dataframe based on IQR for a few selected features. The code I use is the following:
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv("dataframe.csv")
features = df.loc[:, ('col1, col2, col3, col4, col5')]
print("Old Shape: ", df.shape)
def filtering(column_name):
print(column_name)
Q1 = np.percentile(df[column_name], 25,
interpolation = 'midpoint')
Q3 = np.percentile(df[column_name], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
# Upper bound
upper = np.where(df[column_name] >= (Q3 1.5*IQR))
# Lower bound
lower = np.where(df[column_name] <= (Q1-1.5*IQR))
''' Removing the Outliers '''
df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)
print("New Shape: ", df.shape)
print('==== done ====')
for col in features.columns:
filtering(col)
The error (on line 28, df.drop(lower[0], inplace=True):
KeyError: '[14] not found in axis'
The KeyError is caused by the fact that an index is already dropped because it is an outlier in one of the features, after which it is detected again. Since it is already dropped, this index cannot be found. I am however unsure how it is detected as an outlier after already being dropped. Therefore I am unaware how to tackle this problem.
CodePudding user response:
Without the dataframe and line in which the error occurs its not that clear what happens
But in case you just want your script to run you could wrap it with a try/except block - like so:
try:
# Your code
except KeyError:
# Do what you want to do in case a KeyError occurs e.g. log something or print something
CodePudding user response:
After rethinking my approach, I have come up with a solution. I'll post my new approach here:
indices = []
def outlier_indices(column_name):
Q1 = np.percentile(df[column_name], 25, interpolation = 'midpoint')
Q3 = np.percentile(df[column_name], 75, interpolation = 'midpoint')
IQR = Q3 - Q1
# Upper bound
upper = np.where(df[column_name].tolist() >= (Q3 1.5*IQR))[0].tolist()
# Lower bound
lower = np.where(df[column_name].tolist() <= (Q1-1.5*IQR))[0].tolist()
indices.extend(upper)
indices.extend(lower)
for col in features.columns:
outlier_indices(col)
indices = set(indices)
df.drop(indices, inplace=True)
I have tackled the issue of getting duplicate indices by creating a list of indices and using set to remove the duplicate indices, which was then used to drop the outliers.