Home > Blockchain >  Remove outliers
Remove outliers

Time:03-24

I'm working on a housing dataset. Below is my CSV file format:

BHK  Location  Price
1    A         10
1    A         100
2    B         50
3    C         80
4    A         100
1    C         500

In some cases, it is observed that for a particular location, houses with lesser BHK cost more than houses with higher BHK values. This is obviously an error. I want to remove entries like these from my dataset. Any help would be appreciated.

CodePudding user response:

This is less of a python question and more of a statistics one.

Identifying outliers is tricky, one approach might be to create a linear regression of Price ~ BHK Location and use Cooks number to estimate the influence of each observation. Observations of high influence can be labelled as outliers and excluded.

If you want something much simpler, I would use a simple min/max price per BHK and just use .query() or something to remove those observations.

CodePudding user response:

You can compute the min Price per group and filter based on the next group:

min_bhk = df.groupby('BHK')['Price'].min()

outliers = df['Price'].gt(df['BHK'].add(1).map(min_bhk))

df2 = df[~outliers]

output:

   BHK Location  Price
0    1        A     10
2    2        B     50
3    3        C     80
4    4        A    100

NB. In real-life data min/max are probably not the best indicators, you might want to use 1/99% or 5/95% quantiles instead

  • Related