PostgreSQL how to avoid duplicate rows with same date when adding a windows function and groupby-CodePudding

I'm trying to understand how to avoid PSQL duplicate rows when I'm applying a windows function and adding a groupBy.

I created a DBfiddle very simple but was unable to reproduce the issue there, however, is a starting point

Resulting in data removing the windows function and total from groupby

CodePudding user response：

total in the GROUP BY clause causes the extra groups to be generated. To remove that we also need to refer to the SUM(total) aggregate in the select list and also adjust LAG to refer to the prior adjacent row. The - 0 operation is also not necessary.

Here's the updated fiddle with data which reproduces the extra row issue, plus the corrected SQL.

SELECT
    "candidates"."day" AS "day",
    "candidates"."study_id" AS "studyId",
    "candidates"."status" AS "status",
    sum("current") AS "current",
    sum("total") AS "total",
    COALESCE(SUM(total) - LAG(SUM(total)) OVER (ORDER BY day), 0) AS difference
FROM
    "candidates"
WHERE
    "study_id" in('CY')
    AND "status" in('PENDING_CALLCENTER')
GROUP BY
    "day",
    "study_id",
    "status"
ORDER BY
    "day" ASC,
    "study_id" ASC,
    "status" ASC
;

Result prior to the correction:

day	studyId	status	current	total	difference
2021-04-13 01:00:00 01	CY	PENDING_CALLCENTER	0	0	0
2021-04-13 01:00:00 01	CY	PENDING_CALLCENTER	0	1	0
2021-04-14 01:00:00 01	CY	PENDING_CALLCENTER	0	0	0
2021-04-14 01:00:00 01	CY	PENDING_CALLCENTER	0	2	1
2021-04-15 01:00:00 01	CY	PENDING_CALLCENTER	0	3	3
2021-04-16 01:00:00 01	CY	PENDING_CALLCENTER	10	14	12
2021-04-17 01:00:00 01	CY	PENDING_CALLCENTER	0	15	12

Result after the correction:

day	studyId	status	current	total	difference
2021-04-13 01:00:00 01	CY	PENDING_CALLCENTER	0	1	0
2021-04-14 01:00:00 01	CY	PENDING_CALLCENTER	0	2	1
2021-04-15 01:00:00 01	CY	PENDING_CALLCENTER	0	3	1
2021-04-16 01:00:00 01	CY	PENDING_CALLCENTER	10	14	11
2021-04-17 01:00:00 01	CY	PENDING_CALLCENTER	0	15	1

The setup:

CREATE TABLE "public"."candidates" (
    "day" timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
    "study_id" varchar(6) NOT NULL,
    "site_id" varchar(32),
    "status" varchar(32) NOT NULL,
    "total" int4 NOT NULL,
    "current" int4 NOT NULL
);

INSERT INTO "public"."candidates" ("day", "study_id", "site_id", "status", "total", "current") VALUES
('2021-04-13 00:00:00 00', 'CY', 'BEL-1', 'PENDING_CALLCENTER', 0, 0),
('2021-04-13 00:00:00 00', 'CY', 'BEL-1', 'PENDING_CALLCENTER', 1, 0),
('2021-04-14 00:00:00 00', 'CY', 'ESP-1', 'PENDING_CALLCENTER', 0, 0),
('2021-04-14 00:00:00 00', 'CY', 'ESP-1', 'PENDING_CALLCENTER', 2, 0),
('2021-04-15 00:00:00 00', 'CY', 'SWE-1', 'PENDING_CALLCENTER', 3, 0),
('2021-04-16 00:00:00 00', 'CY', 'USA-1', 'PENDING_CALLCENTER', 14, 10),
('2021-04-17 00:00:00 00', 'CY', NULL, 'PENDING_CALLCENTER', 15, 0),
('2021-04-18 00:00:00 00', 'CY', 'BEL-1', 'REJECTED_CALLCENTER', 25, 0),
('2021-04-19 00:00:00 00', 'CY', 'ESP-1', 'REJECTED_CALLCENTER', 50, 0),
('2021-04-20 00:00:00 00', 'CY', 'SWE-1', 'REJECTED_CALLCENTER', 67, 0),
('2021-04-21 00:00:00 00', 'CY', 'USA-1', 'REJECTED_CALLCENTER', 90, 0)
;