SQL SERVER - is there a way to loop a column in a table and use that in the where clause of another-CodePudding

I am trying to fetch a hierarchical data in my database. Here is my initial code.

        SELECT TOP 5
            year, location_state, location_city,
            COUNT(tf.customer_key) Number_of_Customers
        FROM TransactionFact tf
        JOIN LocationDim as ld
            ON ld.location_key = tf.seller_location_key
        JOIN DateDim dd
            ON dd.date_key = tf.order_date_key
        WHERE dd.year = 2016 and location_state = 'SP'
        GROUP BY dd.year, ld.location_state, ld.location_city
        ORDER BY dd.year DESC, Number_of_Customers DESC

And here is the result, result.

Basically, in the query, what I want to do is to not hard code the location_state in the WHERE clause. I want to make it dynamic so that what I get are the top 5 cities in each state.

Here are the column names for the LocationDim table

location_key
location_zip_code_prefix
location_state
location_city

EDITED: What I need is something like this.

 ------ ---------------- --------------- --------------------- 
| year | location_state | location_city | Number_of_Customers |
 ------ ---------------- --------------- --------------------- 
| 2016 |    STATE_1     |    city_1     |       100           |
 ------ ---------------- --------------- --------------------- 
| 2016 |    STATE_1     |    city_2     |       90            |
 ------ ---------------- --------------- --------------------- 
| 2016 |    STATE_1     |    city_3     |       89            |
 ------ ---------------- --------------- --------------------- 
| 2016 |    STATE_1     |    city_4     |       88            |
 ------ ---------------- --------------- --------------------- 
| 2016 |    STATE_1     |    city_5     |       20            |
 ------ ---------------- --------------- --------------------- 
| 2016 |    STATE_2     |    city_1     |       100           |
 ------ ---------------- --------------- --------------------- 
| 2016 |    STATE_2     |    city_2     |       45            |
 ------ ---------------- --------------- --------------------- 
| 2016 |    STATE_2     |    city_3     |       23            |
 ------ ---------------- --------------- --------------------- 
| 2016 |    STATE_2     |    city_4     |       10            |
 ------ ---------------- --------------- --------------------- 
| 2016 |    STATE_2     |    city_5     |       5             |
 ------ ---------------- --------------- ---------------------

PS: Sorry, this is my first question in stackoverflow. If this question is duplicated, pls drop the link and I'll give it a go. Thank u in advance.

CodePudding user response：

If you calculate a row_number then you can filter on that.

SELECT *
FROM 
(
    SELECT 
      dd.[year]
    , ld.location_state
    , ld.location_city
    , COUNT(tf.customer_key) AS total_customers
    , rn = ROW_NUMBER() OVER (PARTITION BY ld.location_state, dd.[year] 
                              ORDER BY COUNT(tf.customer_key) DESC)
    FROM TransactionFact AS tf
    JOIN LocationDim AS ld
      ON ld.location_key = tf.seller_location_key
    JOIN DateDim AS dd
      ON dd.date_key = tf.order_date_key
    WHERE dd.[year] = 2016
    GROUP BY ld.location_state, ld.location_city, dd.[year] 
) q
WHERE rn <= 5
ORDER BY location_state, [year], rn

CodePudding user response：

You can add rownumbers using ROW_NUMBER() OVER (PARTITION BY ORDER BY )

The query below partition records by location_state and add rownumber with number_of_customer order :

Select * from 
inn.*,  ROW_NUMBER() OVER (PARTITION BY location_state ORDER BY Number_of_Customers DESC) AS rn
( 
SELECT year,
       location_state,
       location_city,
       COUNT(tf.customer_key) Number_of_Customers
  FROM TransactionFact tf
  JOIN LocationDim as ld
    ON ld.location_key = tf.seller_location_key
  JOIN DateDim dd
    ON dd.date_key = tf.order_date_key
 WHERE dd.year = 2016
 GROUP BY dd.year, ld.location_state, ld.location_city
 ) inn

After this , you can easily filter this and select top5 or any other order ...

Note: I used your query as inner query. I didnt have a chance to test it since there is no fiddle

CodePudding user response：

you can try to pass the value of each location_state to your query using the cross apply operator as follows:

;With STATES As (
SELECT location_state 
FROM LocationDim 
GROUP BY location_state)
SELECT T.[year], T.location_state, T.location_city, T.Number_of_Customers
FROM STATES CROSS APPLY (
            SELECT TOP 5
                [year], location_state, location_city,
                COUNT(tf.customer_key) AS Number_of_Customers
            FROM TransactionFact tf
            JOIN LocationDim as ld
                ON ld.location_key = tf.seller_location_key
            JOIN DateDim dd
                ON dd.date_key = tf.order_date_key
            WHERE dd.[year] = 2016 and location_state = STATES.location_state
            GROUP BY dd.[year], ld.location_state, ld.location_city
            ORDER BY dd.[year] DESC, Number_of_Customers DESC) As T

CodePudding user response：

So, basically what's happening here:

    SELECT TOP 5
        dd.[year], ld.location_state, ld.location_city, COUNT(tf.customer_key) Number_of_Customers
    FROM TransactionFact tf
    JOIN LocationDim as ld
        ON ld.location_key = tf.seller_location_key
    JOIN DateDim dd
        ON dd.date_key = tf.order_date_key
    WHERE dd.year = 2016 and location_state = 'SP'
    GROUP BY dd.year, ld.location_state, ld.location_city
    ORDER BY dd.year DESC, Number_of_Customers DESC;

Is that you are selecting exclusively the top 5 results from that query.

But what you want is to get all results, ranked by year and state, and take only the top 5 from each state?

I'd use the RANK() function which is designed pretty much specifically for the scenario you're looking at. I'll show it as an added column on your query:

  SELECT * FROM ( SELECT dd.[year], ld.location_state, ld.location_city, COUNT(tf.customer_key) Number_of_Customers,
           RANK() OVER(PARTITION BY dd.[year], ld.location_state, ld.location_city 
                       ORDER BY Number_of_Customers DESC) r
    FROM TransactionFact tf
    JOIN LocationDim as ld
        ON ld.location_key = tf.seller_location_key
    JOIN DateDim dd
        ON dd.date_key = tf.order_date_key
    WHERE dd.year = 2016 and location_state = 'SP' 
    GROUP BY dd.year, ld.location_state, ld.location_city
     ) x WHERE x.r <= 5
     ORDER BY x.[year] desc, x.location_state, x.r

Alternatively, you could use a CTE (Common Table Expression) to hold the results from your first query, before applying the RANK:

;WITH cte AS(
   SELECT dd.[year], ld.location_state, ld.location_city, COUNT(tf.customer_key) Number_of_Customers, 
       RANK() OVER(PARTITION BY [year], location_state, location_city
                   ORDER BY Number_Of_Customers DESC) r
   FROM TransactionFact tf
        JOIN LocationDim as ld
            ON ld.location_key = tf.seller_location_key
        JOIN DateDim dd
            ON dd.date_key = tf.order_date_key
   WHERE dd.year = 2016 and location_state = 'SP'
   GROUP BY dd.year, ld.location_state, ld.location_city
)
SELECT *
FROM cte
WHERE r <= 5;

As a disclaimer, I'm only placing a semicolon before the WITH to indicate that if there is a statement beforehand that doesn't end with a semicolon, then this statement will throw an error.

EDIT: To add, using RANK means that the results get ranked by value. So if two cities have 30,000 customers, they will both get the same value for the RANK (similar to what they do in leaderboards when people are tied in a round of golf). Meaning you would get a minimum of 5 results from each state - if you want exactly 5, regardless of tied values, then you can use ROW_NUMBER in the same way.