I've got a table like this:
product_history
fk_product_id | product_name (varchar) | product_name_changed (bit) | product_cost (int) | product_cost_changed (bit) | timestamp (datetime) |
---|---|---|---|---|---|
284 | Product X | false | 100 | false | 08/04/2022 10am |
285 | Product Y | false | 20 | false | 08/06/2022 1pm |
285 | Product Y-2 | true | 20 | false | 08/07/2022 7am |
285 | Product Y | true | 20 | false | 08/07/2022 8am |
284 | Product X-2 | true | 100 | false | 08/07/2022 1pm |
284 | Product X-3 | true | 500 | true | 08/07/2022 3pm |
286 | Product Z | false | 4 | false | 08/08/2022 11am |
286 | Product Z | false | 8 | true | 08/08/2022 4pm |
286 | Product Z-2 | true | 8 | true | 08/09/2022 2pm |
This is basically a copy of a row in a product table, with a boolean column to show what columns were changed in this edit history. Now the customer wants to query this data for all values between a date range. Here's an example of how they want the results to look for something like a query between 08/07/2022 6am to 08/08/2022 10pm
fk_product_id | product_name_count | product_cost_count |
---|---|---|
284 | 2 | 1 |
285 | 0 | 0 |
286 | 0 | 1 |
Essentially, they want to count all changes (true values in the changed columns) that occur per ID in the given time period. This is fairly easy, my approach on that is to do something like this:
count(CASE WHEN product_name_changed THEN 1 END) AS product_name_count,
in a select, where it's within the range and with GROUP BY fk_product_id.
The Question
The tricky part here is the edge case they want. If, the last entry in the date range for that product ID has the same value as the first entry in the entire table for that product ID, then the [column]_count should be 0. Basically, if the change has been reverted to be equal to the first value, then that should show that no changes have been done in the time period on that column.
In my example, Product Y has it's name changed and reverted. As such, in the output, it's row has a 0 in product_name_count despite there being multiple "true" values in the product_name_changed column for that id. This occurs even though the first Product Y entry is outside the date range.
What I've tried
I've been searching a lot for this. I tried adding something like
last_value(product_name) over (
PARTITION BY fk_product_id]
ORDER BY timestamp) AS product_name,
in order to store the last value and then use that in the logic for the count, but I'm not sure how to do that and get the first_value (including outside of the date range) to do the comparison.
And at this point I might try to make this two procedures and do the logic outside of SQL. However, since it was requested I try to do it in one go, I wanted to ask here and see if anybody could help at all.
CodePudding user response:
One way would be as follows
DECLARE @StartTime datetime = '2022-08-07 06:00:00.000',
@EndTime datetime = '2022-08-08 22:00:00.000';
WITH T AS
(
SELECT fk_product_id,
first_value(product_name) over (PARTITION BY fk_product_id ORDER BY timestamp) AS first_product_name,
first_value(product_cost) over (PARTITION BY fk_product_id ORDER BY timestamp) AS first_product_cost,
first_value(product_name) over (PARTITION BY fk_product_id ORDER BY timestamp desc) AS last_product_name,
first_value(product_cost) over (PARTITION BY fk_product_id ORDER BY timestamp desc) AS last_product_cost,
CAST(product_name_changed AS INT) AS product_name_changed,
CAST(product_cost_changed AS INT) AS product_cost_changed,
CASE WHEN timestamp >= @StartTime THEN 1 ELSE 0 END AS in_time_period
FROM product_history
WHERE timestamp <= @EndTime
)
SELECT fk_product_id,
SUM(CASE WHEN in_time_period = 1 AND first_product_name <> last_product_name THEN product_name_changed ELSE 0 END),
SUM(CASE WHEN in_time_period = 1 AND first_product_cost <> last_product_cost THEN product_cost_changed ELSE 0 END)
FROM T
GROUP BY fk_product_id
HAVING SUM(in_time_period) > 0
The where clause is on timestamp <= @EndTime
so the earliest values are available to the window functions but then there is additional logic added to only count the "changed" events if they fall inside the start/end window of interest and if the overall first value <> the overall last value.
NB1: I just use first_value
and descending order rather than last_value
. If you use last_value
you need to also include UNBOUNDED FOLLOWING
for it to work as you were wanting.
NB2: I also order the columns using the same window frames next to each other. This at least used to prevent some gratuitous sort operations compared to having frames intermingled. I haven't checked recently if this is still the case.