I made this exercise on SQL server: write a query that lists for each cluster the quantity of products that fall within it. The company wants to obtain an analysis of sales with respect to the average quantity of each product present in each order, classifying them into six clusters: Q1 (<15), Q2 (15-20), Q3 (21-25), Q4 (26-30), Q5 (31-35), Q6(>35). Write a query that lists, for each product, the product name and the cluster to which it belong. The database is northwind
select count(ProductName) as prod_num ,cluster
from (
select ProductName,
case
when avg(Quantity) < 15 then 'Q1'
when avg(Quantity) <= 20 then 'Q2'
when avg(Quantity) between 21 and 25 then 'Q3'
when avg(Quantity) between 26 and 30 then 'Q4'
when avg(Quantity) between 31 and 35 then 'Q5'
else 'Q6'
end
as cluster
from [Order Details] od join Products pr on od.ProductID=pr.ProductID
group by ProductName
) as clusters
group by cluster
order by cluster
OUTPUT
22 Q2
35 Q3
18 Q4
2 Q6
I also need to display values for Q1 and Q5.
CodePudding user response:
You can always seed your initial counts, for example:
declare @clusters table (prod_num int, cluster nchar(2));
insert into @clusters values
(0, 'Q1'),(0, 'Q2'),(0, 'Q3'),(0, 'Q4'),(0, 'Q5'),(0, 'Q6');
select
t1.cluster,
t1.prod_num isnull(t2.prod_num, 0) as prod_num
from
@clusters t1
left join
(
select count(ProductName) as prod_num ,cluster
from (
select ProductName,
case
when avg(Quantity) < 15 then 'Q1'
when avg(Quantity) between 15 and 20 then 'Q2'
when avg(Quantity) between 21 and 25 then 'Q3'
when avg(Quantity) between 26 and 30 then 'Q4'
when avg(Quantity) between 31 and 35 then 'Q5'
else 'Q6'
end
as cluster
from [Order Details] od join Products pr on od.ProductID=pr.ProductID
group by ProductName
) as clusters
group by cluster
) t2
on t1.cluster = t2.cluster
order by t1.cluster;
Now we have an initial count of zero for all groups and add to that the counts we found in our query.
Untested so please let me know if you find errors...
CodePudding user response:
You can make your query much simpler by storing the clusters and the ranges in a table. I use a #temp table here but there's no reason this couldn't be a static, permanent dimension table.
CREATE TABLE #clusters(cluster char(2), lo int, hi int);
INSERT #clusters VALUES('Q1', 0,14),('Q2',15,20),('Q3',21,25),
('Q4',26,30),('Q5',31,35),('Q6',36,2000000000);
SELECT prod_num = COUNT(p.ProductName), cl.cluster
FROM #clusters AS cl
LEFT OUTER JOIN
(
SELECT pr.ProductName, avgQ = AVG(od.Quantity)
FROM dbo.[Order Details] AS od
INNER JOIN dbo.Products AS pr
ON od.ProductID = pr.ProductID
GROUP BY pr.ProductName
) AS p
ON p.avgQ BETWEEN cl.lo AND cl.hi
GROUP BY cl.cluster;
Working example in this fiddle.
And again, unless ProductName
is not unique and you really care about averages across different IDs with the same name as opposed to just product IDs, the join is unnecessary and you could simplify even further:
SELECT prod_num = COUNT(p.ProductID), cl.cluster
FROM #clusters AS cl
LEFT OUTER JOIN
(
SELECT ProductID, avgQ = AVG(Quantity)
FROM dbo.[Order Details]
GROUP BY ProductID
) AS p
ON p.avgQ BETWEEN cl.lo AND cl.hi
GROUP BY cl.cluster;