I am trying to fetch a hierarchical data in my database. Here is my initial code.
SELECT TOP 5
year, location_state, location_city,
COUNT(tf.customer_key) Number_of_Customers
FROM TransactionFact tf
JOIN LocationDim as ld
ON ld.location_key = tf.seller_location_key
JOIN DateDim dd
ON dd.date_key = tf.order_date_key
WHERE dd.year = 2016 and location_state = 'SP'
GROUP BY dd.year, ld.location_state, ld.location_city
ORDER BY dd.year DESC, Number_of_Customers DESC
And here is the result, result.
Basically, in the query, what I want to do is to not hard code the location_state in the WHERE
clause. I want to make it dynamic so that what I get are the top 5 cities in each state.
Here are the column names for the LocationDim table
location_key
location_zip_code_prefix
location_state
location_city
EDITED: What I need is something like this.
------ ---------------- --------------- ---------------------
| year | location_state | location_city | Number_of_Customers |
------ ---------------- --------------- ---------------------
| 2016 | STATE_1 | city_1 | 100 |
------ ---------------- --------------- ---------------------
| 2016 | STATE_1 | city_2 | 90 |
------ ---------------- --------------- ---------------------
| 2016 | STATE_1 | city_3 | 89 |
------ ---------------- --------------- ---------------------
| 2016 | STATE_1 | city_4 | 88 |
------ ---------------- --------------- ---------------------
| 2016 | STATE_1 | city_5 | 20 |
------ ---------------- --------------- ---------------------
| 2016 | STATE_2 | city_1 | 100 |
------ ---------------- --------------- ---------------------
| 2016 | STATE_2 | city_2 | 45 |
------ ---------------- --------------- ---------------------
| 2016 | STATE_2 | city_3 | 23 |
------ ---------------- --------------- ---------------------
| 2016 | STATE_2 | city_4 | 10 |
------ ---------------- --------------- ---------------------
| 2016 | STATE_2 | city_5 | 5 |
------ ---------------- --------------- ---------------------
PS: Sorry, this is my first question in stackoverflow. If this question is duplicated, pls drop the link and I'll give it a go. Thank u in advance.
CodePudding user response:
If you calculate a row_number then you can filter on that.
SELECT *
FROM
(
SELECT
dd.[year]
, ld.location_state
, ld.location_city
, COUNT(tf.customer_key) AS total_customers
, rn = ROW_NUMBER() OVER (PARTITION BY ld.location_state, dd.[year]
ORDER BY COUNT(tf.customer_key) DESC)
FROM TransactionFact AS tf
JOIN LocationDim AS ld
ON ld.location_key = tf.seller_location_key
JOIN DateDim AS dd
ON dd.date_key = tf.order_date_key
WHERE dd.[year] = 2016
GROUP BY ld.location_state, ld.location_city, dd.[year]
) q
WHERE rn <= 5
ORDER BY location_state, [year], rn
CodePudding user response:
You can add rownumbers using ROW_NUMBER() OVER (PARTITION BY ORDER BY )
The query below partition records by location_state
and add rownumber with number_of_customer
order :
Select * from
inn.*, ROW_NUMBER() OVER (PARTITION BY location_state ORDER BY Number_of_Customers DESC) AS rn
(
SELECT year,
location_state,
location_city,
COUNT(tf.customer_key) Number_of_Customers
FROM TransactionFact tf
JOIN LocationDim as ld
ON ld.location_key = tf.seller_location_key
JOIN DateDim dd
ON dd.date_key = tf.order_date_key
WHERE dd.year = 2016
GROUP BY dd.year, ld.location_state, ld.location_city
) inn
After this , you can easily filter this and select top5 or any other order ...
Note: I used your query as inner query. I didnt have a chance to test it since there is no fiddle
CodePudding user response:
you can try to pass the value of each location_state to your query using the cross apply
operator as follows:
;With STATES As (
SELECT location_state
FROM LocationDim
GROUP BY location_state)
SELECT T.[year], T.location_state, T.location_city, T.Number_of_Customers
FROM STATES CROSS APPLY (
SELECT TOP 5
[year], location_state, location_city,
COUNT(tf.customer_key) AS Number_of_Customers
FROM TransactionFact tf
JOIN LocationDim as ld
ON ld.location_key = tf.seller_location_key
JOIN DateDim dd
ON dd.date_key = tf.order_date_key
WHERE dd.[year] = 2016 and location_state = STATES.location_state
GROUP BY dd.[year], ld.location_state, ld.location_city
ORDER BY dd.[year] DESC, Number_of_Customers DESC) As T
CodePudding user response:
So, basically what's happening here:
SELECT TOP 5
dd.[year], ld.location_state, ld.location_city, COUNT(tf.customer_key) Number_of_Customers
FROM TransactionFact tf
JOIN LocationDim as ld
ON ld.location_key = tf.seller_location_key
JOIN DateDim dd
ON dd.date_key = tf.order_date_key
WHERE dd.year = 2016 and location_state = 'SP'
GROUP BY dd.year, ld.location_state, ld.location_city
ORDER BY dd.year DESC, Number_of_Customers DESC;
Is that you are selecting exclusively the top 5 results from that query.
But what you want is to get all results, ranked by year and state, and take only the top 5 from each state?
I'd use the RANK()
function which is designed pretty much specifically for the scenario you're looking at. I'll show it as an added column on your query:
SELECT * FROM ( SELECT dd.[year], ld.location_state, ld.location_city, COUNT(tf.customer_key) Number_of_Customers,
RANK() OVER(PARTITION BY dd.[year], ld.location_state, ld.location_city
ORDER BY Number_of_Customers DESC) r
FROM TransactionFact tf
JOIN LocationDim as ld
ON ld.location_key = tf.seller_location_key
JOIN DateDim dd
ON dd.date_key = tf.order_date_key
WHERE dd.year = 2016 and location_state = 'SP'
GROUP BY dd.year, ld.location_state, ld.location_city
) x WHERE x.r <= 5
ORDER BY x.[year] desc, x.location_state, x.r
Alternatively, you could use a CTE
(Common Table Expression) to hold the results from your first query, before applying the RANK
:
;WITH cte AS(
SELECT dd.[year], ld.location_state, ld.location_city, COUNT(tf.customer_key) Number_of_Customers,
RANK() OVER(PARTITION BY [year], location_state, location_city
ORDER BY Number_Of_Customers DESC) r
FROM TransactionFact tf
JOIN LocationDim as ld
ON ld.location_key = tf.seller_location_key
JOIN DateDim dd
ON dd.date_key = tf.order_date_key
WHERE dd.year = 2016 and location_state = 'SP'
GROUP BY dd.year, ld.location_state, ld.location_city
)
SELECT *
FROM cte
WHERE r <= 5;
As a disclaimer, I'm only placing a semicolon before the WITH
to indicate that if there is a statement beforehand that doesn't end with a semicolon, then this statement will throw an error.
EDIT: To add, using RANK
means that the results get ranked by value. So if two cities have 30,000 customers, they will both get the same value for the RANK (similar to what they do in leaderboards when people are tied in a round of golf). Meaning you would get a minimum of 5 results from each state - if you want exactly 5, regardless of tied values, then you can use ROW_NUMBER
in the same way.