I have a dataset which looks like this:
| ReportId | Method | Status | OrganizationId | StartedAt |
|-----------|--------|--------|----------------|-------------------------------|
| 38373bfk8 | Email | 0 | ABC | 2022-06-10 00:00:53.794 0000 |
| 78687fea | Email | 0 | XYZ | 2022-06-10 00:03:51.432 0000 |
| 48978kd | Email | 100 | POD | 2022-06-10 00:02:45.532 0000 |
| 38373bfk8 | Email | 100 | ABC | 2022-06-10 00:00:22.654 0000 |
| 86887dhd | Csv | 100 | FGH | 2022-06-10 00:03:12.541 0000 |
| 78687fea | Email | 100 | XYZ | 2022-06-11 00:04:51.352 0000 |
In the Status column, 0
indicates a failure and 100
indicates a success for the generation of a Report for an Org.
I want to implement logic that if a certain ReportId/Method/OrganizationId combination fails but the same combination has a successful status later in the same day, then exclude that initial failure. I essentially want to keep all successful rows and all failure rows that didn't have a success later in the same day.
From the above dataset, we would remove the first row as there is a success for that ReportId/Method/Status combination later in the day (row 4). We would keep row two despite it also being a failure, because there is no successful status later in the same day (row 6 is the next day). So all rows would be kept except the first.
I have built a CTE to rank the rows based accordingly:
with Ranked as (
select
ReportId,
Method,
Status,
OrganizationId,
StartedAt,
row_number() over (partition by ReportId, Method, OrganizationId, cast(StartedAt as date) order by StartedAt asc) as rn
from
MyTable
)
Then I filter out the rows I don't want based on the above logic:
ExcludeFirstFailures as (
select
ReportId,
Method,
Status,
OrganizationId,
StartedAt,
rn
from
Ranked
where
(Status in 0 and rn > 1) --Keep failures that weren't the first of the day
or Status = 100 --Keep all successful rows
)
This is close to what I need but the problem is that it is filtering out all failures which were the first of the day, without making a comparison if the same report ran ran successfully later on the same day - only those failures should be excluded.
CodePudding user response:
What about this one?
with report_rn as (
select
ReportId,
Method,
Status,
OrganizationId,
StartedAt,
row_number() over (
partition by ReportId, Method, OrganizationId, cast(StartedAt as date)
order by status desc, StartedAt desc
) as rn
from
report
qualify rn = 1
)
select
ReportId,
Method,
Status,
OrganizationId,
StartedAt
from report_rn
order by StartedAt;
----------- -------- -------- ---------------- -------------------------------
| REPORTID | METHOD | STATUS | ORGANIZATIONID | STARTEDAT |
|----------- -------- -------- ---------------- -------------------------------|
| 38373bfk8 | Email | 100 | ABC | 2022-06-10 00:00:22.654000000 |
| 48978kd | Email | 100 | POD | 2022-06-10 00:02:45.532000000 |
| 86887dhd | Csv | 100 | FGH | 2022-06-10 00:03:12.541000000 |
| 78687fea | Email | 0 | XYZ | 2022-06-10 00:03:51.432000000 |
| 78687fea | Email | 100 | XYZ | 2022-06-11 00:04:51.352000000 |
----------- -------- -------- ---------------- -------------------------------
If it ran only once, either successful or failed will have RN = 1. If there is a successful one later on on the same day, because we order by status desc, status 100 will be on the top, so it will be selected.
If there are multiple failures on the same day, then the latest failed one will be returned because we again order by StartedAt desc as well.
CodePudding user response:
Can you try this one?
with successful as (
select
ReportId,
Method,
Status,
OrganizationId,
StartedAt,
MAX(Status) over (partition by ReportId, Method, OrganizationId, cast(StartedAt as date)) as success
from
MyTable
)
select *
from successful
where
success = 0 -- all failure rows that didn't have a success later
or Status = 100 --Keep all successful rows
order by STARTEDAT, REPORTID, status
;
----------- -------- -------- ---------------- ------------------------------- ---------
| REPORTID | METHOD | STATUS | ORGANIZATIONID | STARTEDAT | SUCCESS |
----------- -------- -------- ---------------- ------------------------------- ---------
| 38373bfk8 | Email | 100 | ABC | 2022-06-10 00:00:22.654 0000 | 100 |
| 48978kd | Email | 100 | POD | 2022-06-10 00:02:45.532 0000 | 100 |
| 86887dhd | Csv | 100 | FGH | 2022-06-10 00:03:12.541 0000 | 100 |
| 78687fea | Email | 0 | XYZ | 2022-06-10 00:03:51.432 0000 | 0 |
| 78687fea | Email | 100 | XYZ | 2022-06-11 00:04:51.352 0000 | 100 |
----------- -------- -------- ---------------- ------------------------------- ---------
78687fea wasn't successful on 2022-06-10.
Extra info:
I'm informed that status code 15 is also counted as successful. If this is the complete list of successful codes, we can still use the MAX approach: Instead of using plain "status" column with MAX, we can use "IFF(Status = 15,'100', Status )" expression, and filter for "Status IN (15,100)".
To be able to test, I set the status to 15 for reports (id: 38373bfk8 and 48978kd).
The output of the modified SQL:
----------- -------- -------- ---------------- ------------------------------- ---------
| REPORTID | METHOD | STATUS | ORGANIZATIONID | STARTEDAT | SUCCESS |
----------- -------- -------- ---------------- ------------------------------- ---------
| 38373bfk8 | Email | 15 | ABC | 2022-06-10 00:00:22.654 0000 | 100 |
| 48978kd | Email | 15 | POD | 2022-06-10 00:02:45.532 0000 | 100 |
| 86887dhd | Csv | 100 | FGH | 2022-06-10 00:03:12.541 0000 | 100 |
| 78687fea | Email | 0 | XYZ | 2022-06-10 00:03:51.432 0000 | 0 |
| 78687fea | Email | 100 | XYZ | 2022-06-11 00:04:51.352 0000 | 100 |
----------- -------- -------- ---------------- ------------------------------- ---------