I'm contemplating making decisions on outliers on a dataset with over 300 features. I'd like to analyse the frame without removing the data hastingly. I have a frame:
| | A | B | C | D | E |
|---:|----:|----:|-----:|----:|----:|
| 0 | 100 | 99 | 1000 | 300 | 250 |
| 1 | 665 | 6 | 9 | 1 | 9 |
| 2 | 7 | 665 | 4 | 9 | 1 |
| 3 | 1 | 3 | 4 | 3 | 6 |
| 4 | 1 | 9 | 1 | 665 | 5 |
| 5 | 3 | 4 | 6 | 1 | 9 |
| 6 | 5 | 9 | 1 | 3 | 2 |
| 7 | 1 | 665 | 3 | 2 | 3 |
| 8 | 2 | 665 | 9 | 1 | 0 |
| 9 | 5 | 0 | 7 | 6 | 5 |
| 10 | 0 | 3 | 3 | 7 | 3 |
| 11 | 6 | 3 | 0 | 3 | 6 |
| 12 | 6 | 6 | 5 | 1 | 5 |
I have coded some introspection to be saved in another frame called _outliers:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = (Q3 - Q1)
min_ = (Q1 - (1.5 * IQR))
max_ = (Q3 (1.5 * IQR))
# Counts outliers in columns
_outliers = ((df.le (min_)) | (df.ge (max_))).sum().to_frame(name="outliers")
# Gives percentage of data that outliers represent in the column
_outliers["percent"] = (_outliers['outliers'] / _outliers['outliers'].sum()) * 100
# Shows max value in the column
_outliers["max_val"] = df[_outliers.index].max()
# Shows min value in the column
_outliers["min_val"] = df[_outliers.index].min()
# Shows median value in the column
_outliers["median"] = df[_outliers.index].median()
# Shows mean value in the column
_outliers["mean"] = df[_outliers.index].mean()
That yields:
| | outliers | percent | max_val | min_val | median | mean |
|:---|-----------:|----------:|----------:|----------:|---------:|---------:|
| A | 2 | 22.2222 | 665 | 0 | 5 | 61.6923 |
| B | 3 | 33.3333 | 665 | 0 | 6 | 164.385 |
| C | 1 | 11.1111 | 1000 | 0 | 4 | 80.9231 |
| D | 2 | 22.2222 | 665 | 1 | 3 | 77.0769 |
| E | 1 | 11.1111 | 250 | 0 | 5 | 23.3846 |
I would like to calculate the impact of the outliers on the column by calculating the mean and the median without them. I don't want to remove them to do this calculation. I suppose the best way is to add "~" to the outlier filter but I get lost in the code... This will benefit a lot of people as a search on removing outliers yields a lot of results. Other than the why they sneaked in the data in the first place, I just don't think the removal decision should be made without consideration on the potential impact. Feel free to add other considerations (skewness, sigma, n, etc.)
As always, I'm grateful to this community!
CodePudding user response:
No comment on whether this is an appropriate method to filter out your outliers. The code below should do what you asked:
q1, q3 = df.quantile([0.25, 0.75]).to_numpy()
delta = (q3 - q1) * 1.5
min_val, max_val = q1 - delta, q3 delta
outliers = (df < min_val) | (max_val < df)
result = pd.concat([
outliers.sum().rename("outliers"),
outliers.sum().rename("percent") / len(df) * 100,
pd.DataFrame({"max_val": max_val, "min_val": min_val}, index=df.columns),
df.agg(["median", "mean"]).T,
df.mask(outliers, np.nan).agg(["median", "mean"]).T.add_suffix("_no_outliers")
], axis=1)
Result:
outliers percent max_val min_val median mean median_no_outliers mean_no_outliers
A 2 15.384615 13.5 -6.5 5.0 61.692308 3.0 3.363636
B 3 23.076923 243.0 -141.0 6.0 164.384615 5.0 14.200000
C 1 7.692308 13.0 -3.0 4.0 80.923077 4.0 4.333333
D 2 15.384615 16.0 -8.0 3.0 77.076923 3.0 3.363636
E 1 7.692308 10.5 -1.5 5.0 23.384615 5.0 4.500000
CodePudding user response:
Take advantage of apply
method of DataFrame
.
Series genereator
Just define the way you want the robust mean to apply by creating a method consuming Series
and returning scalar and apply it to your DataFrame
.
For the IRQ mean, here is a simple snippet:
def irq_agg(x, factor=1.5, aggregate=pd.Series.mean):
q1, q3 = x.quantile(0.25), x.quantile(0.75)
return aggregate(x[(q1 - factor*(q3 - q1) < x) & (x < q3 factor*(q3 - q1))])
data.apply(irq_agg)
# A 3.363636
# B 14.200000
# C 4.333333
# D 3.363636
# E 4.500000
# dtype: float64
The same can be done to filter out based on percentiles (both side version):
def quantile_agg(x, alpha=0.05, aggregate=pd.Series.mean):
return aggregate(x[(x.quantile(alpha/2) < x) & (x < x.quantile(1 - alpha/2))])
data.apply(quantile_agg, alpha=0.01)
# A 12.454545
# B 15.777778
# C 4.727273
# D 41.625000
# E 4.909091
# dtype: float64
Frame generator
Even better, create a function that returns a Series
, apply will create a DataFrame
. Then we can compute at once a bunch of different means and medians in order to compare them. We can also reuse Series generator method defined above:
def analyze(x, alpha=0.05, factor=1.5):
return pd.Series({
"p_mean": quantile_agg(x, alpha=alpha),
"p_median": quantile_agg(x, alpha=alpha, aggregate=pd.Series.median),
"irq_mean": irq_agg(x, factor=factor),
"irq_median": irq_agg(x, factor=factor, aggregate=pd.Series.median),
"standard": x[((x - x.mean())/x.std()).abs() < 1].mean(),
"mean": x.mean(),
"median": x.median(),
})
data.apply(analyze).T
# p_mean p_median irq_mean irq_median standard mean median
# A 12.454545 5.0 3.363636 3.0 11.416667 61.692308 5.0
# B 15.777778 6.0 14.200000 5.0 14.200000 164.384615 6.0
# C 4.727273 4.0 4.333333 4.0 4.333333 80.923077 4.0
# D 41.625000 4.5 3.363636 3.0 3.363636 77.076923 3.0
# E 4.909091 5.0 4.500000 5.0 4.500000 23.384615 5.0
Now you can filter out outlier in several ways computes relevant aggregate on it such as mean or median.