Home > Software design >  PostgreSQL how to avoid duplicate rows with same date when adding a windows function and groupby
PostgreSQL how to avoid duplicate rows with same date when adding a windows function and groupby

Time:10-02

I'm trying to understand how to avoid PSQL duplicate rows when I'm applying a windows function and adding a groupBy.

I created a DBfiddle very simple but was unable to reproduce the issue there, however, is a starting point

enter image description here

Resulting in data removing the windows function and total from groupby

Result

CodePudding user response:

total in the GROUP BY clause causes the extra groups to be generated. To remove that we also need to refer to the SUM(total) aggregate in the select list and also adjust LAG to refer to the prior adjacent row. The - 0 operation is also not necessary.

Here's the updated fiddle with data which reproduces the extra row issue, plus the corrected SQL.

SELECT
    "candidates"."day" AS "day",
    "candidates"."study_id" AS "studyId",
    "candidates"."status" AS "status",
    sum("current") AS "current",
    sum("total") AS "total",
    COALESCE(SUM(total) - LAG(SUM(total)) OVER (ORDER BY day), 0) AS difference
FROM
    "candidates"
WHERE
    "study_id" in('CY')
    AND "status" in('PENDING_CALLCENTER')
GROUP BY
    "day",
    "study_id",
    "status"
ORDER BY
    "day" ASC,
    "study_id" ASC,
    "status" ASC
;

Result prior to the correction:

day studyId status current total difference
2021-04-13 01:00:00 01 CY PENDING_CALLCENTER 0 0 0
2021-04-13 01:00:00 01 CY PENDING_CALLCENTER 0 1 0
2021-04-14 01:00:00 01 CY PENDING_CALLCENTER 0 0 0
2021-04-14 01:00:00 01 CY PENDING_CALLCENTER 0 2 1
2021-04-15 01:00:00 01 CY PENDING_CALLCENTER 0 3 3
2021-04-16 01:00:00 01 CY PENDING_CALLCENTER 10 14 12
2021-04-17 01:00:00 01 CY PENDING_CALLCENTER 0 15 12

Result after the correction:

day studyId status current total difference
2021-04-13 01:00:00 01 CY PENDING_CALLCENTER 0 1 0
2021-04-14 01:00:00 01 CY PENDING_CALLCENTER 0 2 1
2021-04-15 01:00:00 01 CY PENDING_CALLCENTER 0 3 1
2021-04-16 01:00:00 01 CY PENDING_CALLCENTER 10 14 11
2021-04-17 01:00:00 01 CY PENDING_CALLCENTER 0 15 1

The setup:

CREATE TABLE "public"."candidates" (
    "day" timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
    "study_id" varchar(6) NOT NULL,
    "site_id" varchar(32),
    "status" varchar(32) NOT NULL,
    "total" int4 NOT NULL,
    "current" int4 NOT NULL
);

INSERT INTO "public"."candidates" ("day", "study_id", "site_id", "status", "total", "current") VALUES
('2021-04-13 00:00:00 00', 'CY', 'BEL-1', 'PENDING_CALLCENTER', 0, 0),
('2021-04-13 00:00:00 00', 'CY', 'BEL-1', 'PENDING_CALLCENTER', 1, 0),
('2021-04-14 00:00:00 00', 'CY', 'ESP-1', 'PENDING_CALLCENTER', 0, 0),
('2021-04-14 00:00:00 00', 'CY', 'ESP-1', 'PENDING_CALLCENTER', 2, 0),
('2021-04-15 00:00:00 00', 'CY', 'SWE-1', 'PENDING_CALLCENTER', 3, 0),
('2021-04-16 00:00:00 00', 'CY', 'USA-1', 'PENDING_CALLCENTER', 14, 10),
('2021-04-17 00:00:00 00', 'CY', NULL, 'PENDING_CALLCENTER', 15, 0),
('2021-04-18 00:00:00 00', 'CY', 'BEL-1', 'REJECTED_CALLCENTER', 25, 0),
('2021-04-19 00:00:00 00', 'CY', 'ESP-1', 'REJECTED_CALLCENTER', 50, 0),
('2021-04-20 00:00:00 00', 'CY', 'SWE-1', 'REJECTED_CALLCENTER', 67, 0),
('2021-04-21 00:00:00 00', 'CY', 'USA-1', 'REJECTED_CALLCENTER', 90, 0)
;
  • Related