pandas outliers with and without calculations-CodePudding

I'm contemplating making decisions on outliers on a dataset with over 300 features. I'd like to analyse the frame without removing the data hastingly. I have a frame:

 |    |   A |   B |    C |   D |   E |
 |---:|----:|----:|-----:|----:|----:|
 |  0 | 100 |  99 | 1000 | 300 | 250 |
 |  1 | 665 |   6 |    9 |   1 |   9 |
 |  2 |   7 | 665 |    4 |   9 |   1 |
 |  3 |   1 |   3 |    4 |   3 |   6 |
 |  4 |   1 |   9 |    1 | 665 |   5 |
 |  5 |   3 |   4 |    6 |   1 |   9 |
 |  6 |   5 |   9 |    1 |   3 |   2 |
 |  7 |   1 | 665 |    3 |   2 |   3 |
 |  8 |   2 | 665 |    9 |   1 |   0 |
 |  9 |   5 |   0 |    7 |   6 |   5 |
 | 10 |   0 |   3 |    3 |   7 |   3 |
 | 11 |   6 |   3 |    0 |   3 |   6 |
 | 12 |   6 |   6 |    5 |   1 |   5 |

I have coded some introspection to be saved in another frame called _outliers:

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = (Q3 - Q1)
min_ = (Q1 - (1.5 * IQR))
max_ = (Q3   (1.5 * IQR))
# Counts outliers in columns
_outliers = ((df.le (min_)) | (df.ge (max_))).sum().to_frame(name="outliers")
# Gives percentage of data that outliers represent in the column
_outliers["percent"] = (_outliers['outliers'] / _outliers['outliers'].sum()) * 100
# Shows max value in the column
_outliers["max_val"] = df[_outliers.index].max()
# Shows min value in the column
_outliers["min_val"] = df[_outliers.index].min()
# Shows median value in the column
_outliers["median"] = df[_outliers.index].median()
# Shows mean value in the column
_outliers["mean"] = df[_outliers.index].mean()

That yields:

|    |   outliers |   percent |   max_val |   min_val |   median |     mean |
|:---|-----------:|----------:|----------:|----------:|---------:|---------:|
| A  |          2 |   22.2222 |       665 |         0 |        5 |  61.6923 |
| B  |          3 |   33.3333 |       665 |         0 |        6 | 164.385  |
| C  |          1 |   11.1111 |      1000 |         0 |        4 |  80.9231 |
| D  |          2 |   22.2222 |       665 |         1 |        3 |  77.0769 |
| E  |          1 |   11.1111 |       250 |         0 |        5 |  23.3846 |

I would like to calculate the impact of the outliers on the column by calculating the mean and the median without them. I don't want to remove them to do this calculation. I suppose the best way is to add "~" to the outlier filter but I get lost in the code... This will benefit a lot of people as a search on removing outliers yields a lot of results. Other than the why they sneaked in the data in the first place, I just don't think the removal decision should be made without consideration on the potential impact. Feel free to add other considerations (skewness, sigma, n, etc.)

As always, I'm grateful to this community!

CodePudding user response：

No comment on whether this is an appropriate method to filter out your outliers. The code below should do what you asked:

q1, q3 = df.quantile([0.25, 0.75]).to_numpy()
delta = (q3 - q1) * 1.5
min_val, max_val = q1 - delta, q3   delta
outliers = (df < min_val) | (max_val < df)

result = pd.concat([
    outliers.sum().rename("outliers"),
    outliers.sum().rename("percent") / len(df) * 100,
    pd.DataFrame({"max_val": max_val, "min_val": min_val}, index=df.columns),
    df.agg(["median", "mean"]).T,
    df.mask(outliers, np.nan).agg(["median", "mean"]).T.add_suffix("_no_outliers")
], axis=1)

Result:

   outliers    percent  max_val  min_val  median        mean  median_no_outliers  mean_no_outliers
A         2  15.384615     13.5     -6.5     5.0   61.692308                 3.0          3.363636
B         3  23.076923    243.0   -141.0     6.0  164.384615                 5.0         14.200000
C         1   7.692308     13.0     -3.0     4.0   80.923077                 4.0          4.333333
D         2  15.384615     16.0     -8.0     3.0   77.076923                 3.0          3.363636
E         1   7.692308     10.5     -1.5     5.0   23.384615                 5.0          4.500000

CodePudding user response：

Take advantage of apply method of DataFrame.

Series genereator

Just define the way you want the robust mean to apply by creating a method consuming Series and returning scalar and apply it to your DataFrame.

For the IRQ mean, here is a simple snippet:

def irq_agg(x, factor=1.5, aggregate=pd.Series.mean):
    q1, q3 = x.quantile(0.25), x.quantile(0.75) 
    return aggregate(x[(q1 - factor*(q3 - q1) < x) & (x < q3   factor*(q3 - q1))])

data.apply(irq_agg)

# A     3.363636
# B    14.200000
# C     4.333333
# D     3.363636
# E     4.500000
# dtype: float64

The same can be done to filter out based on percentiles (both side version):

def quantile_agg(x, alpha=0.05, aggregate=pd.Series.mean):
    return aggregate(x[(x.quantile(alpha/2) < x) & (x < x.quantile(1 - alpha/2))])

data.apply(quantile_agg, alpha=0.01)

# A    12.454545
# B    15.777778
# C     4.727273
# D    41.625000
# E     4.909091
# dtype: float64

Frame generator

Even better, create a function that returns a Series, apply will create a DataFrame. Then we can compute at once a bunch of different means and medians in order to compare them. We can also reuse Series generator method defined above:

def analyze(x, alpha=0.05, factor=1.5):
    return pd.Series({
        "p_mean": quantile_agg(x, alpha=alpha),
        "p_median": quantile_agg(x, alpha=alpha, aggregate=pd.Series.median),
        "irq_mean": irq_agg(x, factor=factor),
        "irq_median": irq_agg(x, factor=factor, aggregate=pd.Series.median),
        "standard": x[((x - x.mean())/x.std()).abs() < 1].mean(),
        "mean": x.mean(),
        "median": x.median(),
    })

data.apply(analyze).T

#       p_mean  p_median   irq_mean  irq_median   standard        mean  median
# A  12.454545       5.0   3.363636         3.0  11.416667   61.692308     5.0
# B  15.777778       6.0  14.200000         5.0  14.200000  164.384615     6.0
# C   4.727273       4.0   4.333333         4.0   4.333333   80.923077     4.0
# D  41.625000       4.5   3.363636         3.0   3.363636   77.076923     3.0
# E   4.909091       5.0   4.500000         5.0   4.500000   23.384615     5.0

Now you can filter out outlier in several ways computes relevant aggregate on it such as mean or median.