I have a set of query results that look a bit like this:
customer|sales_growth_percentage
--------|-----------------------
A |0.00
B |0.00
C |0.00
D |15.05
E |20.00
F |25.10
G |30.00
H |500.00
I |600.20
What I want to do is calculate the following while excluding any statistical outliers in sales growth (e.g. customers H & I):
count(customer),
sum(sales_growth_percentage)
I know I can calculate the standard deviation using the STDDEV function, which for the example above gives me 224.84. Does anyone know how to use this to exclude outliers please? Are outliers literally anything with sales growth higher than 224.84?
I presume I could save the standard deviation as a variable, save my query results into a temp table, and then select from the temp table where sales_growth_percentage <=224.84? Is this the best way to do it, or is there a more effective method?
Any help would be appreciated!
CodePudding user response:
You can pre-compute the average and standard deviation in a CTE, so then you can use it in the main query. For example:
with
stats (av, st) as (
select avg(sales_growth_percentage), stddev(sales_growth_percentage) from t
)
select *
from t
cross join stats s
where t.sales_growth_percentage between s.av - s.st and s.av s.st
Result:
CUSTOMER SALES_GROWTH_PERCENTAGE AV ST
--------- ------------------------ ---------------------------- ---------------
A 0.00 132.26111111111111111111111 224.83562857462
B 0.00 132.26111111111111111111111 224.83562857462
C 0.00 132.26111111111111111111111 224.83562857462
D 15.05 132.26111111111111111111111 224.83562857462
E 20.00 132.26111111111111111111111 224.83562857462
F 25.10 132.26111111111111111111111 224.83562857462
G 30.00 132.26111111111111111111111 224.83562857462
See running example at db<>fiddle.
CodePudding user response:
You may use the corresponding OLAP functions:
/*
WITH MYTAB (customer, sales_growth_percentage) AS
(
VALUES
('A', 0.00)
, ('B', 0.00)
, ('C', 0.00)
, ('D', 15.05)
, ('E', 20.00)
, ('F', 25.10)
, ('G', 30.00)
, ('H', 500.00)
, ('I', 600.20)
)
*/
SELECT *
FROM
(
SELECT
T.*
, STDDEV (sales_growth_percentage) OVER () AS ST
, AVG (sales_growth_percentage) OVER () AS AV
FROM MYTAB T
)
WHERE sales_growth_percentage BETWEEN AV - ST AND AV ST;
The result is:
CUSTOMER | SALES_GROWTH_PERCENTAGE | ST | AV |
---|---|---|---|
A | 0.00 | 224.83562857461717 | 132.2611111111111111111111111111 |
B | 0.00 | 224.83562857461717 | 132.2611111111111111111111111111 |
C | 0.00 | 224.83562857461717 | 132.2611111111111111111111111111 |
D | 15.05 | 224.83562857461717 | 132.2611111111111111111111111111 |
E | 20.00 | 224.83562857461717 | 132.2611111111111111111111111111 |
F | 25.10 | 224.83562857461717 | 132.2611111111111111111111111111 |
G | 30.00 | 224.83562857461717 | 132.2611111111111111111111111111 |