I'm trying to understand how to avoid PSQL duplicate rows when I'm applying a windows function and adding a groupBy
.
I created a DBfiddle very simple but was unable to reproduce the issue there, however, is a starting point
Resulting in data removing the windows function and total from groupby
CodePudding user response:
total
in the GROUP BY
clause causes the extra groups to be generated. To remove that we also need to refer to the SUM(total)
aggregate in the select list and also adjust LAG
to refer to the prior adjacent row. The - 0
operation is also not necessary.
Here's the updated fiddle with data which reproduces the extra row issue, plus the corrected SQL.
SELECT
"candidates"."day" AS "day",
"candidates"."study_id" AS "studyId",
"candidates"."status" AS "status",
sum("current") AS "current",
sum("total") AS "total",
COALESCE(SUM(total) - LAG(SUM(total)) OVER (ORDER BY day), 0) AS difference
FROM
"candidates"
WHERE
"study_id" in('CY')
AND "status" in('PENDING_CALLCENTER')
GROUP BY
"day",
"study_id",
"status"
ORDER BY
"day" ASC,
"study_id" ASC,
"status" ASC
;
Result prior to the correction:
day | studyId | status | current | total | difference |
---|---|---|---|---|---|
2021-04-13 01:00:00 01 | CY | PENDING_CALLCENTER | 0 | 0 | 0 |
2021-04-13 01:00:00 01 | CY | PENDING_CALLCENTER | 0 | 1 | 0 |
2021-04-14 01:00:00 01 | CY | PENDING_CALLCENTER | 0 | 0 | 0 |
2021-04-14 01:00:00 01 | CY | PENDING_CALLCENTER | 0 | 2 | 1 |
2021-04-15 01:00:00 01 | CY | PENDING_CALLCENTER | 0 | 3 | 3 |
2021-04-16 01:00:00 01 | CY | PENDING_CALLCENTER | 10 | 14 | 12 |
2021-04-17 01:00:00 01 | CY | PENDING_CALLCENTER | 0 | 15 | 12 |
Result after the correction:
day | studyId | status | current | total | difference |
---|---|---|---|---|---|
2021-04-13 01:00:00 01 | CY | PENDING_CALLCENTER | 0 | 1 | 0 |
2021-04-14 01:00:00 01 | CY | PENDING_CALLCENTER | 0 | 2 | 1 |
2021-04-15 01:00:00 01 | CY | PENDING_CALLCENTER | 0 | 3 | 1 |
2021-04-16 01:00:00 01 | CY | PENDING_CALLCENTER | 10 | 14 | 11 |
2021-04-17 01:00:00 01 | CY | PENDING_CALLCENTER | 0 | 15 | 1 |
The setup:
CREATE TABLE "public"."candidates" (
"day" timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
"study_id" varchar(6) NOT NULL,
"site_id" varchar(32),
"status" varchar(32) NOT NULL,
"total" int4 NOT NULL,
"current" int4 NOT NULL
);
INSERT INTO "public"."candidates" ("day", "study_id", "site_id", "status", "total", "current") VALUES
('2021-04-13 00:00:00 00', 'CY', 'BEL-1', 'PENDING_CALLCENTER', 0, 0),
('2021-04-13 00:00:00 00', 'CY', 'BEL-1', 'PENDING_CALLCENTER', 1, 0),
('2021-04-14 00:00:00 00', 'CY', 'ESP-1', 'PENDING_CALLCENTER', 0, 0),
('2021-04-14 00:00:00 00', 'CY', 'ESP-1', 'PENDING_CALLCENTER', 2, 0),
('2021-04-15 00:00:00 00', 'CY', 'SWE-1', 'PENDING_CALLCENTER', 3, 0),
('2021-04-16 00:00:00 00', 'CY', 'USA-1', 'PENDING_CALLCENTER', 14, 10),
('2021-04-17 00:00:00 00', 'CY', NULL, 'PENDING_CALLCENTER', 15, 0),
('2021-04-18 00:00:00 00', 'CY', 'BEL-1', 'REJECTED_CALLCENTER', 25, 0),
('2021-04-19 00:00:00 00', 'CY', 'ESP-1', 'REJECTED_CALLCENTER', 50, 0),
('2021-04-20 00:00:00 00', 'CY', 'SWE-1', 'REJECTED_CALLCENTER', 67, 0),
('2021-04-21 00:00:00 00', 'CY', 'USA-1', 'REJECTED_CALLCENTER', 90, 0)
;