Home > Mobile >  Remove outlier using quantile python
Remove outlier using quantile python

Time:02-14

I need to remove outlier for a regression dataset. Lets say the dataset is consist in the following way

# dataset named df
humidity     windspeed
 0.01          4.9
 4.5           20.0
 3.5           5.0
 50.0          4.0
 4.2           0.05
 3.4           3.9
 18.0          4.7

# code for outlier removal
def quantile(columns):
   for column in columns:
      lower_quantile = df[column].quantile(0.25)
      upper_quantile = df[column].quantile(0.75)
      df[column] = df[(df[column] >= lower_quantile) & df[column] <= upper_quantile)

columns = ['humidity', 'windspeed']
quantile(columns)

With closer inspection, the column humidity has three outliers which are 50.0,18.0,0.01 but for windspeed column the outliers are 20 and 0.05 and both columns outliers are not in the same row. In this case if I remove my outlier with the code above, I would get the following error:

Value error: Columns must be same length as key

From what I understand, the length of row in each column is not the same once the outlier is removed hence it return me the error. I would like to ask, if there is other way to overcome this issue? Thanks in advance.

CodePudding user response:

You may filter for both columns at the same time,

df[
    df['humidity'].between(df['humidity'].quantile(.25), df['humidity'].quantile(.75)) &\
    df['windspeed'].between(df['windspeed'].quantile(.25), df['windspeed'].quantile(.75))
]

In this case all three of the df, the conditions for 'humidity' and that for 'windspeed' share the same length because they are all derived from the same df.

  • Related