Remove outliers-CodePudding

I'm working on a housing dataset. Below is my CSV file format:

BHK  Location  Price
1    A         10
1    A         100
2    B         50
3    C         80
4    A         100
1    C         500

In some cases, it is observed that for a particular location, houses with lesser BHK cost more than houses with higher BHK values. This is obviously an error. I want to remove entries like these from my dataset. Any help would be appreciated.

CodePudding user response：

This is less of a python question and more of a statistics one.

Identifying outliers is tricky, one approach might be to create a linear regression of Price ~ BHK Location and use Cooks number to estimate the influence of each observation. Observations of high influence can be labelled as outliers and excluded.

If you want something much simpler, I would use a simple min/max price per BHK and just use .query() or something to remove those observations.

CodePudding user response：

You can compute the min Price per group and filter based on the next group:

min_bhk = df.groupby('BHK')['Price'].min()

outliers = df['Price'].gt(df['BHK'].add(1).map(min_bhk))

df2 = df[~outliers]

output:

   BHK Location  Price
0    1        A     10
2    2        B     50
3    3        C     80
4    4        A    100

NB. In real-life data min/max are probably not the best indicators, you might want to use 1/99% or 5/95% quantiles instead