I'm working through a problem set from CMU's public db systems course. I have the following two tables:
Order
Id | CustomerId | EmployeeId | OrderDate | RequiredDate | ShippedDate | ShipVia | Freight | ShipName | ShipAddress | ShipCity | ShipRegion | ShipPostalCode | ShipCountry |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10248 | VINET | 5 | 2012-07-04 | 2012-08-01 | 2012-07-16 | 3 | 16.75 | Vins et alcools Chevalier | 59 rue de l'Abbaye | Reims | Western Europe | 51100 | France |
10249 | TOMSP | 6 | 2012-07-05 | 2012-08-16 | 2012-07-10 | 1 | 22.25 | Toms Spezialitäten | Luisenstr. 48 | Münster | Western Europe | 44087 | Germany |
Shipper
Id | CompanyName | Phone |
---|---|---|
1 | Speedy Express | (503) 555-9831 |
2 | United Package | (503) 555-3199 |
The question I am attempting to answer is the following:
For each Shipper, find the percentage of orders which are late. Details: An order is considered late if ShippedDate > RequiredDate. Print the following format, order by descending percentage, rounded to the nearest hundredths, like
United Package|23.44
.
This is the query I came up with, which produces the expected results:
SELECT
CompanyName,
ROUND(100*SUM(IIF(ShippedDate > RequiredDate, 1, 0))/Cast(Count(ShipName) as Float), 2) as percent
FROM 'Order' AS O
JOIN Shipper AS S
ON O.ShipVia = S.Id
GROUP BY CompanyName
ORDER BY percent DESC;
The official solution is a more involved query as follows:
SELECT CompanyName, round(delayCnt * 100.0 / cnt, 2) AS pct
FROM (
SELECT ShipVia, COUNT(*) AS cnt
FROM 'Order'
GROUP BY ShipVia
) AS totalCnt
INNER JOIN (
SELECT ShipVia, COUNT(*) AS delaycnt
FROM 'Order'
WHERE ShippedDate > RequiredDate
GROUP BY ShipVia
) AS delayCnt
ON totalCnt.ShipVia = delayCnt.ShipVia
INNER JOIN Shipper on totalCnt.ShipVia = Shipper.Id
ORDER BY pct DESC;
The outputs of the two queries match, each returning the following:
Federal Shipping|23.61
Speedy Express|23.46
United Package|23.44
What is the benefit of doing it as shown in the solution rather than using my approach?
CodePudding user response:
The official solution is a very bad solution because it aggregates twice on the same table Order
and then joins the results whilst the same can be achieved by a single aggregation query (like your first query).
I can't believe that this would be proposed as a proper solution for this type of problems.
I hope it's just for practice on joins for subqueries.
Also, I find it very disappointing to see the use of single quotes for table (or column) names/aliases.
This is a very bad practice an it can lead to unexpected results.
When needed use double quotes which is the SQL standard (square brackets or backticks also work in SQLite).
As I said your first query is correct and it could be further simplified with the use of the aggregate function AVG()
:
SELECT S.CompanyName,
ROUND(100*AVG(COALESCE(O.ShippedDate, '') > O.RequiredDate), 2) AS percent
FROM Shipper AS S JOIN "Order" AS O
ON O.ShipVia = S.Id
GROUP BY S.Id
ORDER BY percent DESC;
Note the use of Id
(it is the primary key of Shipper
) in the GROUP BY
clause instead of CompanyName
, which provides better performance.
Also, I used COALESCE()
, because I found (after downloading your database) that in the column ShippedDate
there are null
s.