I'm working on a housing dataset. Below is my CSV file format:
BHK Location Price
1 A 10
1 A 100
2 B 50
3 C 80
4 A 100
1 C 500
In some cases, it is observed that for a particular location, houses with lesser BHK cost more than houses with higher BHK values. This is obviously an error. I want to remove entries like these from my dataset. Any help would be appreciated.
CodePudding user response:
This is less of a python question and more of a statistics one.
Identifying outliers is tricky, one approach might be to create a linear regression of Price ~ BHK Location
and use Cooks number to estimate the influence of each observation. Observations of high influence can be labelled as outliers and excluded.
If you want something much simpler, I would use a simple min/max price per BHK and just use .query()
or something to remove those observations.
CodePudding user response:
You can compute the min Price per group and filter based on the next group:
min_bhk = df.groupby('BHK')['Price'].min()
outliers = df['Price'].gt(df['BHK'].add(1).map(min_bhk))
df2 = df[~outliers]
output:
BHK Location Price
0 1 A 10
2 2 B 50
3 3 C 80
4 4 A 100
NB. In real-life data min/max are probably not the best indicators, you might want to use 1/99% or 5/95% quantiles instead