I am struggling with window functions in SQL Server. I have a table that is tracking how many records were transferred. I wrote a query to sum how many rows are transferred for each parameter. However, at some point records had to be resend (they were dropped from the final location and resend). So, if I continue with my old query, I get duplicated values.
This is an example table:
parameter | rows | min_id | max_id | create_date | status |
---|---|---|---|---|---|
A1 | 48 | 350 | 521 | 06.11.2022 | sent |
A1 | 48 | 350 | 521 | 06.11.2022 | error |
A1 | 78 | 1 | 350 | 05.11.2022 | sent |
A1 | 13 | 299 | 350 | 04.11.2022 | sent |
A1 | 50 | 100 | 299 | 03.11.2022 | sent |
A1 | 15 | 1 | 100 | 01.11.2022 | sent |
B2 | 87 | 800 | 1202 | 07.11.2022 | sent |
B2 | 187 | 1 | 800 | 06.11.2022 | sent |
B2 | 12 | 570 | 800 | 04.11.2022 | sent |
B2 | 120 | 320 | 570 | 03.11.2022 | sent |
B2 | 55 | 1 | 320 | 01.11.2022 | sent |
You can understand when the table was resend when min_id is 1 again.
The result I want to achieve is:
parameter | sum | min_id | max_id | max_date |
---|---|---|---|---|
A1 | 126 | 1 | 521 | 06.11.2022 |
B2 | 274 | 1 | 1202 | 07.11.2022 |
What I was able to do so far (but is causing duplicate results):
SELECT * FROM
(SELECT
parameter
, sum(rows) over (partition by parameter) as sum
, min_id
, max_id
, MAX(create_date) over (partition by parameter) as max_date
FROM my_table) as s
WHERE create_date = max_date and status = 'sent'
I think that maybe one more window function (nested window function?) needs to be added that will make a certain range of partitions starting with min_id=1 having the latest create_date. However, I failed to do so. Could anyone advise on how to approach this?
CodePudding user response:
Instead of using partitions you can try using the GROUP BY clause since the output which you are expecting might have aggregations in all columns. You can try the below query:
SELECT
parameter,
SUM(rows) AS rows,
MIN(min_id) AS min_id,
MAX(max_id) AS max_id,
MAX(create_date) AS max_date
FROM my_table
WHERE status = 'sent'
GROUP BY parameter
This will give you the expected output. But still if you'd like to go with your old query and the duplicates are the only issue, you can try using DISTINCT keyword after SELECT to give you the unique records.
CodePudding user response:
With a small adjustment you could fetch the results as below:
SELECT parameter, sum(rows) as sum, min(min_id) as min_id, max(max_id) as max_id,
max(create_date) as max_date
FROM
(SELECT
parameter
, rows
, min_id
, max_id
, create_date
, status
, MAX(case when min_id = 1 then create_date end) over (partition by parameter) as sent_start
FROM my_table) as s
WHERE create_date >= sent_start and status = 'sent'
GROUP BY parameter
It's worth considering the variations in data. Could records be resent with a min_id greater than 1? Can records be sent and resent within the same day?
If any of these are a possibility you may want to test using an EXISTS condition:
OPTION 2
;WITH SentRows as
(
SELECT *
FROM my_table
WHERE status='sent'
)
SELECT parameter, sum(rows) as sum, min(min_id) as min_id, max(max_id) as max_id,
max(create_date) as max_date
FROM SentRows as s
WHERE NOT EXISTS
(SELECT 1 FROM SentRows t WHERE t.parameter = s.parameter AND t.create_date > s.create_date
AND t.min_id <= s.min_id AND t.max_id >= s.max_id)
GROUP BY parameter
For partially overlapping records you may want to involve window functions, but here it isn't required.