I'd to delete some outliers from my dataframe
Product | Brand | Year | calcium_100g | phosphorus_100g | iron_100g | magnesium_100g |
---|---|---|---|---|---|---|
Poduct A | Brand A | 2020 | 8 | 50 | NaN | NaN |
Poduct B | Brand A | 2021 | 54 | -1 | NaN | 17 |
Poduct C | Brand C | 2020 | NaN | NaN | NaN | NaN |
Poduct D | Brand C | 2018 | NaN | 50 | 80 | NaN |
Poduct E | Brand E | 2019 | 123 | 50 | NaN | 27 |
Outliers I'd like to delete are values bigger than 100 and below 0 from columns ending by "_100g" (-1 and 123) in that case.
I found a way to filter columns ending by "_100g"
Columns100g = list(data.filter(like='_100g', axis = 1).columns)
But at this point I can't find a way to delete my outliers.
CodePudding user response:
Pandas have a function name drop that you can use alongside the filter code you wrote.
# Deleting columns
# Delete the "_100" column from the dataframe
data = data.drop("_100", axis=1)
# alternatively, delete columns using the columns parameter of drop
data = data.drop(columns="_100")
# Delete the _100 column from the dataframe in place
# Note that the original 'data' object is changed when inplace=True
data.drop("_100", axis=1, inplace=True).
# Delete multiple columns from the dataframe
data = data.drop(["Y2001", "Y2002", "Y2003"], axis=1)
If you can replace the ones I wrote (_100) with your filter, you are good to go. also, you can read more about the drop function in the link below: https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
CodePudding user response:
I would suggest using the drop()
method which accepts index
to remove the rows, and to use lt(0)
and gt(100)
to get those indexes to be removed using |
(or) and any(1)
which would return True
for any condition being satisfied for any column in the selected ones:
# Columns that have '_100g'
c = df.filter(like='_100g').columns
# Drop the rows above / below your threshold
new = df.drop(df[
df[(df[c].lt(0)) | (df[c].gt(100))
].any(1)==True].index)
Prints back:
print(new)
Product Brand Year ... phosphorus_100g iron_100g magnesium_100g
0 Poduct A Brand A 2020 ... 50.0 NaN NaN
2 Poduct C Brand C 2020 ... NaN NaN NaN
3 Poduct D Brand C 2018 ... 50.0 80.0 NaN
[3 rows x 7 columns]
CodePudding user response:
Using your Columns100g
, you can use a for-loop to filter multiple columns:
for col in Columns100g:
data = data[(data[col].fillna(0)>=0)&(data[col].fillna(0)<=100)]
Edit: But if you want to change outlier values to NaN, you can simply do:
data[(data[Columns100g]<0)|(data[Columns100g]>100)] = np.nan
Output:
Product Brand Year calcium_100g phosphorus_100g iron_100g magnesium_100g
0 Poduct A Brand A 2020 8.0 50.0 NaN NaN
1 Poduct B Brand A 2021 54.0 NaN NaN 17.0
2 Poduct C Brand C 2020 NaN NaN NaN NaN
3 Poduct D Brand C 2018 NaN 50.0 80.0 NaN
4 Poduct E Brand E 2019 NaN 50.0 NaN 27.0