How to delete outliers of a specific column-CodePudding

I'd to delete some outliers from my dataframe

Product	Brand	Year	calcium_100g	phosphorus_100g	iron_100g	magnesium_100g
Poduct A	Brand A	2020	8	50	NaN	NaN
Poduct B	Brand A	2021	54	-1	NaN	17
Poduct C	Brand C	2020	NaN	NaN	NaN	NaN
Poduct D	Brand C	2018	NaN	50	80	NaN
Poduct E	Brand E	2019	123	50	NaN	27

Outliers I'd like to delete are values bigger than 100 and below 0 from columns ending by "_100g" (-1 and 123) in that case.

I found a way to filter columns ending by "_100g"

Columns100g = list(data.filter(like='_100g', axis = 1).columns)

But at this point I can't find a way to delete my outliers.

CodePudding user response：

Pandas have a function name drop that you can use alongside the filter code you wrote.

    # Deleting columns
# Delete the "_100" column from the dataframe
data = data.drop("_100", axis=1)

# alternatively, delete columns using the columns parameter of drop
data = data.drop(columns="_100")

# Delete the _100 column from the dataframe in place
# Note that the original 'data' object is changed when inplace=True
data.drop("_100", axis=1, inplace=True). 

# Delete multiple columns from the dataframe
data = data.drop(["Y2001", "Y2002", "Y2003"], axis=1)

If you can replace the ones I wrote (_100) with your filter, you are good to go. also, you can read more about the drop function in the link below: https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/

CodePudding user response：

I would suggest using the drop() method which accepts index to remove the rows, and to use lt(0) and gt(100) to get those indexes to be removed using | (or) and any(1) which would return True for any condition being satisfied for any column in the selected ones:

# Columns that have '_100g'
c = df.filter(like='_100g').columns

# Drop the rows above / below your threshold
new = df.drop(df[
    df[(df[c].lt(0)) | (df[c].gt(100))
       ].any(1)==True].index)

Prints back:

print(new)

    Product    Brand  Year  ...  phosphorus_100g  iron_100g  magnesium_100g
0  Poduct A  Brand A  2020  ...             50.0        NaN             NaN
2  Poduct C  Brand C  2020  ...              NaN        NaN             NaN
3  Poduct D  Brand C  2018  ...             50.0       80.0             NaN

[3 rows x 7 columns]

CodePudding user response：

Using your Columns100g, you can use a for-loop to filter multiple columns:

for col in Columns100g:
    data = data[(data[col].fillna(0)>=0)&(data[col].fillna(0)<=100)]

Edit: But if you want to change outlier values to NaN, you can simply do:

data[(data[Columns100g]<0)|(data[Columns100g]>100)] = np.nan

Output:

    Product    Brand  Year  calcium_100g  phosphorus_100g  iron_100g  magnesium_100g  
0  Poduct A  Brand A  2020           8.0             50.0        NaN   NaN
1  Poduct B  Brand A  2021          54.0              NaN        NaN   17.0
2  Poduct C  Brand C  2020           NaN              NaN        NaN   NaN
3  Poduct D  Brand C  2018           NaN             50.0       80.0   NaN
4  Poduct E  Brand E  2019           NaN             50.0        NaN   27.0