I'm trying to understand why this happens in the data frame
import pandas as pd
import numpy as np
#from pyspark.sql import SparkSession
#spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame({"calories": [400, 200, 220, 70000, 500, 200, 300, 200, 100, 100, 100, 200, 300, 100, 200, 300, 400, 500, 100]})
q_low = df["calories"].quantile(0.01)
q_hi = df["calories"].quantile(0.99)
lb = df.quantile(0.01)
ub = df.quantile(0.99)
#replaces outliers with nan
df_filtered = df[(df < ub) & (df > lb)]
#removes outliers
df_filtered = df[(df["calories"] < q_hi) & (df["calories"] > q_low)]
print(df_filtered)
The first df_filtered will set all outliers to NaN, while the second df_filtered will remove all outliers. What is the functional difference between these two operations? Why does the first one set the outliers to NaN, and the second one just remove them?
CodePudding user response:
This is a really subtle difference and an interesting post! Bravo.
The reason these come out differently is that in the line
df_filtered = df[(df < ub) & (df > lb)]
you are actually attempting to subset by a DataFrame of boolean values. It's a single-column DataFrame, not a Series. That is,
type((df < ub) & (df > lb))
gives pandas.core.frame.DataFrame
.
In the second case,
df_filtered = df[(df["calories"] < q_hi) & (df["calories"] > q_low)]
you are subsetting by a boolean Series which we can again check by calling
type((df["calories"] < q_hi) & (df["calories"] > q_low))
which gives pandas.core.series.Series
.
When you a DataFrame by a boolean Series, you remove rows corresponding the False
values in the Series. When you attempt to subset by a boolean DataFrame, you just cause any elements corresponding to False
values to become NaN
s.
This behavior makes sense when you start to look at DataFrames with more than column. Consider the toy DataFrame, df
:
calories calories
0 400.0 401.0
1 200.0 201.0
2 220.0 221.0
We could compare this whole DataFrame to a single value, but it wouldn't make sense to remove either a whole column or row because one value didn't fit our condition. So any values that don't meet the conditions get set to NaN, as in
df[df < 401]
giving
calories calories
0 400.0 NaN
1 200.0 201.0
2 220.0 221.0