Home > OS >  Outlier detection not working in pandas, negative number on the lower limit
Outlier detection not working in pandas, negative number on the lower limit

Time:08-16

I have the following dataset:

https://raw.githubusercontent.com/Joffreybvn/real-estate-data-analysis/master/data/clean/belgium_real_estate.csv

with describe I get the min and max price

type_of_property    price   number_of_rooms house_area  fully_equipped_kitchen  open_fire   terrace garden  surface_of_the_land number_of_facades   ... city_Zwevegem   city_Zwijnaarde as new  good    just renovated  to be done up   to renovate to restore  unknown pricepersqm
count   40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    ... 40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    40395.000000    40395.000000
mean    0.469241    314114.661617   2.813838    152.466320  0.697512    0.053596    0.620176    0.321228    545.840079  2.071494    ... 0.001015    0.000347    0.299443    0.271940    0.053150    0.069043    0.060428    0.003491    0.242505    2345.424291
std 0.499059    168151.672366   1.260968    95.649206   0.459341    0.225221    0.485349    0.466954    3609.242736 1.416501    ... 0.031843    0.018614    0.458020    0.444964    0.224336    0.253531    0.238282    0.058978    0.428604    1286.481438
min 0.000000    2500.000000 1.000000    5.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    4.166667
25% 0.000000    199000.000000   2.000000    92.000000   0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    1569.563490
50% 0.000000    275000.000000   3.000000    130.000000  1.000000    0.000000    1.000000    0.000000    0.000000    2.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    2180.616740
75% 1.000000    379000.000000   3.000000    184.000000  1.000000    0.000000    1.000000    1.000000    416.000000  3.000000    ... 0.000000    0.000000    1.000000    1.000000    0.000000    0.000000    0.000000    0.000000    0.000000    2837.837838
max 1.000000    950000.000000   18.000000   3560.000000 1.000000    1.000000    1.000000    1.000000    400000.000000   4.000000    ... 1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    67000.000000
8 rows × 1065 columns

minimum 2500 maximum: 950.000

When I try to use outlier detection with this function:

def outliers(df, feature):
    Q1= df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    upper_limit = Q3   1.5 * IQR
    lower_limit = Q1 - 1.5 * IQR
    return upper_limit, lower_limit

upper, lower = outliers(data, "price")
print("Upper whisker: ", upper)
print("Lower Whisker: ", lower)

I get a negative number for the lower limit

Upper whisker: 649000.0 Lower Whisker: -71000.0

What am I doing wrong here?

CodePudding user response:

Your method is correct, however a small change need to be done to make it complete.

When the lower limit (Q1 - 1.5 * IQR) is smaller than the minimum, you don't have any outliers that are very low.

On the other hand, the upper limit (Q3 1.5 * IQR) is smaller than the maximum, so there are some outliers that are very high.

From wikipedia:

The upper whisker boundary of the box-plot is the largest data value that is within 1.5 IQR above the third quartile.

Similarly, the lower whisker boundary of the box plot is the smallest data value that is within 1.5 IQR below the first quartile.

So in your case the upper whisker is the largest data value that is below 649,000.

And your lower whisker is your minimum: 2,500.

  • Related