Count of boolean column compared to first and last values-CodePudding

I've got a table like this:

product_history

fk_product_id	product_name (varchar)	product_name_changed (bit)	product_cost (int)	product_cost_changed (bit)	timestamp (datetime)
284	Product X	false	100	false	08/04/2022 10am
285	Product Y	false	20	false	08/06/2022 1pm
285	Product Y-2	true	20	false	08/07/2022 7am
285	Product Y	true	20	false	08/07/2022 8am
284	Product X-2	true	100	false	08/07/2022 1pm
284	Product X-3	true	500	true	08/07/2022 3pm
286	Product Z	false	4	false	08/08/2022 11am
286	Product Z	false	8	true	08/08/2022 4pm
286	Product Z-2	true	8	true	08/09/2022 2pm

This is basically a copy of a row in a product table, with a boolean column to show what columns were changed in this edit history. Now the customer wants to query this data for all values between a date range. Here's an example of how they want the results to look for something like a query between 08/07/2022 6am to 08/08/2022 10pm

fk_product_id	product_name_count	product_cost_count
284	2	1
285	0	0
286	0	1

Essentially, they want to count all changes (true values in the changed columns) that occur per ID in the given time period. This is fairly easy, my approach on that is to do something like this:

count(CASE WHEN product_name_changed THEN 1 END) AS product_name_count,

in a select, where it's within the range and with GROUP BY fk_product_id.

The Question

The tricky part here is the edge case they want. If, the last entry in the date range for that product ID has the same value as the first entry in the entire table for that product ID, then the [column]_count should be 0. Basically, if the change has been reverted to be equal to the first value, then that should show that no changes have been done in the time period on that column.

In my example, Product Y has it's name changed and reverted. As such, in the output, it's row has a 0 in product_name_count despite there being multiple "true" values in the product_name_changed column for that id. This occurs even though the first Product Y entry is outside the date range.

What I've tried

I've been searching a lot for this. I tried adding something like

last_value(product_name) over (
    PARTITION BY fk_product_id]
    ORDER BY timestamp) AS product_name,

in order to store the last value and then use that in the logic for the count, but I'm not sure how to do that and get the first_value (including outside of the date range) to do the comparison.

And at this point I might try to make this two procedures and do the logic outside of SQL. However, since it was requested I try to do it in one go, I wanted to ask here and see if anybody could help at all.

CodePudding user response：

One way would be as follows

DECLARE @StartTime datetime = '2022-08-07 06:00:00.000', 
        @EndTime datetime = '2022-08-08 22:00:00.000';


WITH T AS
(
SELECT fk_product_id,       
       first_value(product_name) over (PARTITION BY fk_product_id ORDER BY timestamp) AS first_product_name, 
       first_value(product_cost) over (PARTITION BY fk_product_id ORDER BY timestamp) AS first_product_cost, 
       first_value(product_name) over (PARTITION BY fk_product_id ORDER BY timestamp desc) AS last_product_name,
       first_value(product_cost) over (PARTITION BY fk_product_id ORDER BY timestamp desc) AS last_product_cost,
       CAST(product_name_changed AS INT) AS product_name_changed,
       CAST(product_cost_changed AS INT) AS product_cost_changed,
       CASE WHEN timestamp >= @StartTime THEN 1 ELSE 0 END AS in_time_period
FROM product_history
WHERE timestamp <= @EndTime 
)
SELECT fk_product_id,
       SUM(CASE WHEN in_time_period = 1 AND first_product_name <> last_product_name THEN product_name_changed ELSE 0 END),
       SUM(CASE WHEN in_time_period = 1 AND first_product_cost <> last_product_cost THEN product_cost_changed ELSE 0 END)
FROM T
GROUP BY fk_product_id
HAVING SUM(in_time_period) > 0

The where clause is on timestamp <= @EndTime so the earliest values are available to the window functions but then there is additional logic added to only count the "changed" events if they fall inside the start/end window of interest and if the overall first value <> the overall last value.

Fiddle

NB1: I just use first_value and descending order rather than last_value. If you use last_value you need to also include UNBOUNDED FOLLOWING for it to work as you were wanting.

NB2: I also order the columns using the same window frames next to each other. This at least used to prevent some gratuitous sort operations compared to having frames intermingled. I haven't checked recently if this is still the case.