I have the following dataset:
with describe I get the min and max price
type_of_property price number_of_rooms house_area fully_equipped_kitchen open_fire terrace garden surface_of_the_land number_of_facades ... city_Zwevegem city_Zwijnaarde as new good just renovated to be done up to renovate to restore unknown pricepersqm
count 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 ... 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000 40395.000000
mean 0.469241 314114.661617 2.813838 152.466320 0.697512 0.053596 0.620176 0.321228 545.840079 2.071494 ... 0.001015 0.000347 0.299443 0.271940 0.053150 0.069043 0.060428 0.003491 0.242505 2345.424291
std 0.499059 168151.672366 1.260968 95.649206 0.459341 0.225221 0.485349 0.466954 3609.242736 1.416501 ... 0.031843 0.018614 0.458020 0.444964 0.224336 0.253531 0.238282 0.058978 0.428604 1286.481438
min 0.000000 2500.000000 1.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.166667
25% 0.000000 199000.000000 2.000000 92.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1569.563490
50% 0.000000 275000.000000 3.000000 130.000000 1.000000 0.000000 1.000000 0.000000 0.000000 2.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2180.616740
75% 1.000000 379000.000000 3.000000 184.000000 1.000000 0.000000 1.000000 1.000000 416.000000 3.000000 ... 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2837.837838
max 1.000000 950000.000000 18.000000 3560.000000 1.000000 1.000000 1.000000 1.000000 400000.000000 4.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 67000.000000
8 rows × 1065 columns
minimum 2500 maximum: 950.000
When I try to use outlier detection with this function:
def outliers(df, feature):
Q1= df[feature].quantile(0.25)
Q3 = df[feature].quantile(0.75)
IQR = Q3 - Q1
upper_limit = Q3 1.5 * IQR
lower_limit = Q1 - 1.5 * IQR
return upper_limit, lower_limit
upper, lower = outliers(data, "price")
print("Upper whisker: ", upper)
print("Lower Whisker: ", lower)
I get a negative number for the lower limit
Upper whisker: 649000.0 Lower Whisker: -71000.0
What am I doing wrong here?
CodePudding user response:
Your method is correct, however a small change need to be done to make it complete.
When the lower limit (Q1 - 1.5 * IQR
) is smaller than the minimum, you don't have any outliers that are very low.
On the other hand, the upper limit (Q3 1.5 * IQR
) is smaller than the maximum, so there are some outliers that are very high.
From wikipedia:
The upper whisker boundary of the box-plot is the largest data value that is within 1.5 IQR above the third quartile.
Similarly, the lower whisker boundary of the box plot is the smallest data value that is within 1.5 IQR below the first quartile.
So in your case the upper whisker is the largest data value that is below 649,000.
And your lower whisker is your minimum: 2,500.