I have a query that can be seen on this fiddle and I'm relatively new to PSQL, having a few months of experience.
The link to the third party is to show the result of it otherwise would be difficult to visualize it here.
WITH statuses_flow AS (
SELECT
c.id,
c.study_id,
c.site_id,
s.type AS status_type,
s.timestamp AS status_from,
sa.type,
row_number() OVER (ORDER BY s.candidate_id,
s.timestamp) AS row_no
FROM
public.candidates c
JOIN public.statuses s ON s.candidate_id = c.id
JOIN public.statuses sa ON sa.candidate_id = c.id
AND sa.id in(
SELECT
max(id)
FROM public.statuses
GROUP BY
candidate_id)
WHERE
c.study_id in('INIT1')
AND c.site_id in('Test1')
AND sa.type != 'ANONYMISED' ORDER BY
row_no ASC
)
SELECT
statuses_flow.id, statuses_flow.study_id AS "studyId", statuses_flow.site_id AS "siteId", statuses_flow.status_type AS "statusType", statuses_flow.status_from AS "statusFrom", next_status.status_from AS "statusTo", CASE WHEN next_status.status_from IS NULL THEN
NULL
ELSE
(
SELECT
created_at AS first_contact
FROM
public.activities
WHERE
candidate_id = statuses_flow.id
AND TYPE in('PHONE', 'SMS', 'EMAIL')
AND created_at BETWEEN statuses_flow.status_from
AND next_status.status_from ORDER BY
created_at FETCH FIRST 1 ROWS ONLY)
END AS "first_contact"
FROM
statuses_flow
LEFT JOIN statuses_flow next_status ON statuses_flow.id = next_status.id
AND statuses_flow.row_no 1 = next_status.row_no
WHERE
statuses_flow.status_type in('PENDING_SITE', 'PENDING_CALLCENTER', 'INCOMPLETE', 'REJECTED_CALLCENTER', 'REJECTED_SITE', 'CONSENTED')
ORDER BY
statuses_flow.id,
statuses_flow.status_from
This query collects and returns some metrics joining 3 tables about candidates being in a specific status during a medical study process.
The query in the real life is very slow and needs some adjustments to be written better.
The indexes are already applied on the original DB but still would be useful to find a better way.
The section of the query which I'm not convinced at this point is lines 29 to 42 from the DB fiddle link above.
I was wondering if by any chance could be changed to use a left join or another method to make a better performance.
It is returning per every candidate something as for example for one candidate the table below
id | studyId | siteId | statusType | statusFrom | statusTo | first_contact |
---|---|---|---|---|---|---|
1 | Study1 | Site1 | INCOMPLETE | 2021-07-20 09:30:52.101055 00 | 2021-07-20 09:31:53.568346 00 | NULL |
1 | Study1 | Site1 | PENDING_CALLCENTER | 2021-07-20 09:31:53.568346 00 | 2021-07-20 09:35:34.171876 00 | 2021-07-20 09:31:55.849 00 |
1 | Study1 | Site1 | PENDING_SITE | 2021-07-20 09:35:34.171876 00 | 2021-07-20 09:52:42.185163 00 | 2021-07-20 09:35:56.642 00 |
1 | Study1 | Site1 | REJECTED_SITE | 2021-07-20 09:53:08.874271 00 | NULL | NULL |
To explain, the same candidate, in a range of time can be in different statuses and for example, Candidate 1 (for simplicity C1) in the table above, was in 4 statuses.
C1 stayed in incomplete status from/to and on this status period was not contacted so that why we have null on last column first_contact.
C1 was contacted in the next 2 statuses and we determine the first_contact checking from the activities table the minimum activity timestamps of the types PHONE, EMAIL, and SMS in the range of time C1 was in that specific status from/to. The first_contact is the first time a candidate was contacted by EMAIL PHONE or SMS.
On the last status, we have statusTo = null
so also firstContact
is null.
The result I'm getting now is correct but seems not efficient to me and probably can get better but have no confidence in the way to manage the change in this situation.
CodePudding user response:
First using LEAD to eliminate an extra join. Plus a cte to precompute max id can probably also help
WITH sa as(
select candidate_id, type
from (
SELECT candidate_id, type, row_number() OVER (partition by candidate_id ORDER BY id desc) rn
FROM public.statuses
) t
where rn = 1 and type != 'ANONYMISED'
), statuses_flow AS (
SELECT
c.id,
c.study_id,
c.site_id,
s.type AS status_type,
s.timestamp AS status_from,
sa.type,
lead(s.timestamp) OVER (partition by c.id ORDER BY s.candidate_id,
s.timestamp) AS statusTo
FROM
public.candidates c
JOIN public.statuses s ON s.candidate_id = c.id
JOIN sa ON sa.candidate_id = c.id
WHERE
c.study_id in('INIT1')
AND c.site_id in('Test1')
AND sa.type != 'ANONYMISED'
)
SELECT
statuses_flow.id, statuses_flow.study_id AS "studyId",
statuses_flow.site_id AS "siteId", statuses_flow.status_type AS "statusType",
statuses_flow.status_from AS "statusFrom",
statuses_flow.statusTo,
CASE WHEN statuses_flow.statusTo IS NULL THEN
NULL
ELSE
(
SELECT
created_at AS first_contact
FROM
public.activities
WHERE
candidate_id = statuses_flow.id
AND TYPE in('PHONE', 'SMS', 'EMAIL')
AND created_at BETWEEN statuses_flow.status_from
AND statuses_flow.statusTo ORDER BY
created_at FETCH FIRST 1 ROWS ONLY)
END AS "first_contact"
FROM
statuses_flow
WHERE
statuses_flow.status_type in('PENDING_SITE', 'PENDING_CALLCENTER', 'INCOMPLETE', 'REJECTED_CALLCENTER', 'REJECTED_SITE', 'CONSENTED')
ORDER BY
statuses_flow.id,
statuses_flow.status_from
db<>fiddle , with comparing the results.