Calculate cumulative percentages by date in SQL-CodePudding

How might I calculate cumulative percentages in SQL (Postgres/Vertica)? For instance, the question is "As of each date, of all patients who had been diagnosed by that date, what percent had been treated by that date?"

For instance, this table shows dates of diagnosis and treatment, with binary values that might be summed

ID |  diagnosed | date_diag |  treated | date_treat
---|------------|-----------|----------|-----------
1      1            Jan 1         0        null
2      1            Jan 15        1        Feb 20
3      1            Jan 29        1        Feb 1
4      1            Feb 08        1        Mar 4             
5      1            Feb 12        0        null
6      1            Feb 18        1        Feb 24
7      1            Mar 15        1        May 5
8      1            Apr 14        1        Apr 20

I'd like to get a table of cumulative treated-vs-diagnosed ratio that might look like this.

date   | ytd_diag   | ytd_treat |  ytd_percent
-------|------------|-----------|----------
Jan 01        1        0           0.00
Jan 15        2        0           0.00
Jan 29        3        0           0.00
Feb 08        4        1           0.25
Feb 12        5        1           0.20
Feb 18        6        1           0.17
Mar 15        7        4           0.57
Apr 14        8        4           0.50

I can calculate cumulative counts of diagnosed or treated (e.g. below), using window functions but I can't figure out a SQL query to get the number of people who'd already been treated as of each diagnosis date.

SELECT
  date_diag ,
  SUM(COUNT(*)) OVER ( ORDER BY date_diag ) as freq
FROM patients
WHERE diagnosed = 1
GROUP BY date_diag
ORDER BY date_diag;

CodePudding user response：

You can use conditional aggregation with SUM() window function:

WITH cte AS (
  SELECT kind,
         date,
         SUM((kind = 1)::int) OVER (ORDER BY date) ytd_diag,
         SUM((kind = 2)::int) OVER (ORDER BY date) ytd_treat 
  FROM (
    SELECT 1 kind, date_diag date, diagnosed status FROM patients
    UNION ALL
    SELECT 2, date_treat, treated FROM patients WHERE date_treat IS NOT NULL
  ) t  
)  
SELECT date, ytd_diag, ytd_treat,
       ROUND(1.0 * ytd_treat / ytd_diag, 2) ytd_percent
FROM cte 
WHERE kind = 1;

See the demo.

CodePudding user response：

You can solve this with window functions. The first thing you want to do is to derive a table from your patients table that has a running tally of both the diagnosed and treated columns. The rows should be tallied in ascending order of the diagnosis date.

Here's how you do that.First I'll create a sample patients table and data (I'll only include the columns necessary):

create temporary table patients (
    date_diag date,
  diagnosed int default 0,
  treated int default 0
);

insert into patients (date_diag, diagnosed, treated) values
('2021-01-01', 1, 0),
('2021-01-11', 1, 1),
('2021-01-16', 1, 0),
('2021-01-30', 1, 1),
('2021-02-04', 1, 1),
('2021-01-14', 1, 1);

Then here's how to create the derived table of all the tallied results.

  select 
  date_diag,
  diagnosed,
  treated,
  sum(treated) over(order by date_diag ASC ) as treated_cmtv,
    count(diagnosed) over(order by date_diag ASC) as diagnosed_cmtv
  from patients

/*
 date_diag  | diagnosed | treated | treated_cmtv | diagnosed_cmtv
------------ ----------- --------- -------------- ----------------
 2021-01-01 |         1 |       0 |            0 |              1
 2021-01-11 |         1 |       1 |            1 |              2
 2021-01-14 |         1 |       1 |            2 |              3
 2021-01-16 |         1 |       0 |            2 |              4
 2021-01-30 |         1 |       1 |            3 |              5
 2021-02-04 |         1 |       1 |            4 |              6
*/

Now that you have this table you can easily calculate the percentage by using defining this derived table in a subquery and then selecting the necessary columns for the calculation. Like so:

select 
p.date_diag,
p.diagnosed,
p.diagnosed_cmtv,
p.treated_cmtv,
p.treated,
TRUNC(p.treated_cmtv::numeric / p.diagnosed_cmtv * 1.0, 2) as percent
from (
  -- same table as above
  select 
  date_diag,
  diagnosed,
  treated,
  sum(treated) over(order by date_diag ASC ) as treated_cmtv,
    count(diagnosed) over(order by date_diag ASC) as diagnosed_cmtv
  from patients
) as p;

/*
 date_diag  | diagnosed | diagnosed_cmtv | treated_cmtv | treated | percent
------------ ----------- ---------------- -------------- --------- ---------
 2021-01-01 |         1 |              1 |            0 |       0 |    0.00
 2021-01-11 |         1 |              2 |            1 |       1 |    0.50
 2021-01-14 |         1 |              3 |            2 |       1 |    0.66
 2021-01-16 |         1 |              4 |            2 |       0 |    0.50
 2021-01-30 |         1 |              5 |            3 |       1 |    0.60
 2021-02-04 |         1 |              6 |            4 |       1 |    0.66
*/

I think that gives you what you are asking for.

CodePudding user response：

An alternative approach to the other answers is to use a coordinated sub query in the select

SELECT 
 p.date_diag,
 (SELECT COUNT(*)
  FROM patients p2 
  WHERE p2.date_treat <= p.date_diag) ytd_treated
FROM
  patients p
WHERE diagnosed = 1
GROUP BY p.date_diag
ORDER BY p.date_diag

This will give you that column of 0,0,0,1,1,4,4 - you can divide it by the diagnosed column to give your percentage

SELECT
   (select ...) / SUM(COUNT(*)) OVER(...)

Note you might need some more clauses in your inner where, such as having a treated date greater than or equal to Jan 1st of the year of the diag date if you're running it against a dataset with more than just one year's data

Also bear in mind that treated as an integer will (should) nearly always be less than diagnosed so if you do an integer divide you'll get zero. Cast one of the operands to float or if you're doing your percentage out of a hundred maybe *100.0