Home > other >  Python Pandas Removing outliers vs Nan outliers
Python Pandas Removing outliers vs Nan outliers

Time:06-22

I'm trying to understand why this happens in the data frame

import pandas as pd 
import numpy as np
#from pyspark.sql import SparkSession

#spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame({"calories": [400, 200, 220, 70000, 500, 200, 300, 200, 100, 100, 100, 200, 300, 100, 200, 300, 400, 500, 100]})

q_low = df["calories"].quantile(0.01)
q_hi = df["calories"].quantile(0.99)
lb = df.quantile(0.01)
ub = df.quantile(0.99)

#replaces outliers with nan
df_filtered = df[(df < ub) & (df > lb)]

#removes outliers
df_filtered = df[(df["calories"] < q_hi) & (df["calories"] > q_low)]
print(df_filtered)

The first df_filtered will set all outliers to NaN, while the second df_filtered will remove all outliers. What is the functional difference between these two operations? Why does the first one set the outliers to NaN, and the second one just remove them?

CodePudding user response:

This is a really subtle difference and an interesting post! Bravo.

The reason these come out differently is that in the line

df_filtered = df[(df < ub) & (df > lb)]

you are actually attempting to subset by a DataFrame of boolean values. It's a single-column DataFrame, not a Series. That is,

type((df < ub) & (df > lb))

gives pandas.core.frame.DataFrame.

In the second case,

df_filtered = df[(df["calories"] < q_hi) & (df["calories"] > q_low)]

you are subsetting by a boolean Series which we can again check by calling

type((df["calories"] < q_hi) & (df["calories"] > q_low))

which gives pandas.core.series.Series.

When you a DataFrame by a boolean Series, you remove rows corresponding the False values in the Series. When you attempt to subset by a boolean DataFrame, you just cause any elements corresponding to False values to become NaNs.

This behavior makes sense when you start to look at DataFrames with more than column. Consider the toy DataFrame, df:

   calories  calories
0     400.0     401.0
1     200.0     201.0
2     220.0     221.0

We could compare this whole DataFrame to a single value, but it wouldn't make sense to remove either a whole column or row because one value didn't fit our condition. So any values that don't meet the conditions get set to NaN, as in

df[df < 401]

giving

   calories  calories
0     400.0       NaN
1     200.0     201.0
2     220.0     221.0
  • Related