How to remove outliers using standard deviation?-CodePudding

I have a set of query results that look a bit like this:

customer|sales_growth_percentage
--------|-----------------------
A       |0.00
B       |0.00
C       |0.00
D       |15.05
E       |20.00
F       |25.10
G       |30.00
H       |500.00
I       |600.20

What I want to do is calculate the following while excluding any statistical outliers in sales growth (e.g. customers H & I):

count(customer),
sum(sales_growth_percentage)

I know I can calculate the standard deviation using the STDDEV function, which for the example above gives me 224.84. Does anyone know how to use this to exclude outliers please? Are outliers literally anything with sales growth higher than 224.84?

I presume I could save the standard deviation as a variable, save my query results into a temp table, and then select from the temp table where sales_growth_percentage <=224.84? Is this the best way to do it, or is there a more effective method?

Any help would be appreciated!

CodePudding user response：

You can pre-compute the average and standard deviation in a CTE, so then you can use it in the main query. For example:

with
stats (av, st) as (
  select avg(sales_growth_percentage), stddev(sales_growth_percentage) from t
)
select *
from t
cross join stats s
where t.sales_growth_percentage between s.av - s.st and s.av   s.st

Result:

 CUSTOMER  SALES_GROWTH_PERCENTAGE  AV                           ST              
 --------- ------------------------ ---------------------------- --------------- 
 A         0.00                     132.26111111111111111111111  224.83562857462 
 B         0.00                     132.26111111111111111111111  224.83562857462 
 C         0.00                     132.26111111111111111111111  224.83562857462 
 D         15.05                    132.26111111111111111111111  224.83562857462 
 E         20.00                    132.26111111111111111111111  224.83562857462 
 F         25.10                    132.26111111111111111111111  224.83562857462 
 G         30.00                    132.26111111111111111111111  224.83562857462

See running example at db<>fiddle.

CodePudding user response：

You may use the corresponding OLAP functions:

/*
WITH MYTAB (customer, sales_growth_percentage) AS
(
VALUES
  ('A',   0.00)
, ('B',   0.00)
, ('C',   0.00)
, ('D',  15.05)
, ('E',  20.00)
, ('F',  25.10)
, ('G',  30.00)
, ('H', 500.00)
, ('I', 600.20)
)
*/
SELECT *
FROM
(
SELECT 
  T.*
, STDDEV (sales_growth_percentage) OVER () AS ST 
, AVG (sales_growth_percentage) OVER ()    AS AV
FROM MYTAB T
)
WHERE sales_growth_percentage BETWEEN AV - ST AND AV   ST;

The result is:

CUSTOMER	SALES_GROWTH_PERCENTAGE	ST	AV
A	0.00	224.83562857461717	132.2611111111111111111111111111
B	0.00	224.83562857461717	132.2611111111111111111111111111
C	0.00	224.83562857461717	132.2611111111111111111111111111
D	15.05	224.83562857461717	132.2611111111111111111111111111
E	20.00	224.83562857461717	132.2611111111111111111111111111
F	25.10	224.83562857461717	132.2611111111111111111111111111
G	30.00	224.83562857461717	132.2611111111111111111111111111