Sorry if the question doesn't make any sense. I'm practicing on a sample sales data sheet and want to find distribution of shipping modes for each state. The following code finds distribution for shipping modes on entire sheet:
SELECT
ship_mode,
ROUND(CAST(COUNT(ship_mode) * 100 AS float) / (SELECT CAST(COUNT(*) AS float) FROM sales_data_clean), 2) AS [Percentage]
FROM
sales_data_clean
GROUP BY
ship_mode
ORDER BY
ship_mode
Output:
First Class 15.32
Same Day 5.49
Second Class 19.41
Standard Class 59.79
Now I want to find the distribution for each state, so sample output should look something like this (using random percentage values):
State Ship_mode Percentage
----------------------------------
Alabama First Class 15.0
Alabama Same Day 15.0
Alabama Second Class 30.0
Alabama Standard Class 40.0
Arizona First Class 25.0
Arizona Same Day 25.0
Arizona Second Class 25.0
Arizona Standard Class 25.0
whereas each state's ship modes combined should add up to 100. Right now I've tried this code:
SELECT state, ship_mode,
round(cast(count(ship_mode) * 100 as float) /(SELECT cast(count(*) as float) from sales_data_clean), 2) as [Percentage]
FROM sales_data_clean
GROUP BY ship_mode, state
ORDER BY state,ship_mode
Alabama First Class 0.09
Alabama Same Day 0.01
Alabama Second Class 0.18
Alabama Standard Class 0.34
Arizona First Class 0.43
Arizona Same Day 0.16
Arizona Second Class 0.28
Arizona Standard Class 1.41
As you can see, the percentage column reflects each ship modes count divided by grand total of rows, instead of being divided by each state.
Thank you in advance
Bonus - is there a way to optimize this line of code?
ROUND(CAST(COUNT(ship_mode) * 100 AS float) / (SELECT CAST(COUNT(*) AS float) FROM sales_data_clean), 2)
CodePudding user response:
The general calculation isn't too hard. Getting the total to be exactly 100 is trickier. Try something like this:
with data as (
SELECT state, ship_mode,
cast(round(count(*) * 100e / sum(count(*)) over (), 2) as decimal(5, 2)) as pct,
row_number() over (order by count(*), state, ship_mode) as rn1,
row_number() over (order by count(*) desc, state desc, ship_mode desc) as rn2
FROM sales_data_clean
GROUP BY state, ship_mode
)
select state, ship_mode, sum(pct) over (), pct,
case
when sum(pct) over () < 100 and rn2 = 1 then pct 100 - sum(pct) over ()
when sum(pct) over () > 100 and rn1 = 1 then pct - 100 sum(pct) over ()
else pct
end as pct
from data
order by state, ship_mode;
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=d31b17537e46f86f9fe3ea79ccc4c650
The idea of distributing the rounding errors to work out to neat total is not entirely trivial.
CodePudding user response:
You're in the right direction, but I think you want to divide by the SUM of the counts for each state. There are a few ways to obtain that. Here's one:
SELECT State, Ship_mode
, CONVERT(DECIMAL(12,3), (COUNT(*) * 100.0) / SUM(COUNT(*)) OVER (PARTITION BY State)) AS pct
FROM sales_data_clean
GROUP BY state, ship_mode
;
The result:
State | Ship_mode | pct |
---|---|---|
Alabama | First Class | 10.000 |
Alabama | Same Day | 20.000 |
Alabama | Second Class | 30.000 |
Alabama | Standard Class | 40.000 |
Arizona | First Class | 16.667 |
Arizona | Same Day | 16.667 |
Arizona | Second Class | 33.333 |
Arizona | Standard Class | 33.333 |
As already noted, distributing rounding error can be tricky. While this doesn't always total to exactly 100, given just any data, it's fairly close as a first try. With this test data, we happen to be lucky.
Thanks to shawnt00 for the initial fiddle. I've adjusted the data slightly.
CREATE TABLE sales_data_clean (
State VARCHAR(20)
, Ship_mode VARCHAR(20)
);
INSERT INTO sales_data_clean (State, Ship_mode) VALUES
('Alabama', 'First Class')
, ('Alabama', 'Second Class')
, ('Alabama', 'Second Class')
, ('Alabama', 'Second Class')
, ('Alabama', 'Standard Class')
, ('Alabama', 'Standard Class')
, ('Alabama', 'Standard Class')
, ('Alabama', 'Standard Class')
, ('Alabama', 'Same Day')
, ('Alabama', 'Same Day')
, ('Arizona', 'First Class')
, ('Arizona', 'Second Class')
, ('Arizona', 'Second Class')
, ('Arizona', 'Standard Class')
, ('Arizona', 'Standard Class')
, ('Arizona', 'Same Day')
;