SQL: How to I create another "maximum" categorical value column based on multiple partitio-CodePudding

I have this table (table_2), which basically provides patient info and the department they visited on specific dates -

Patient_ID	department	service_dte	birth_dte	count
12345	cardiac	1/2/21	6/18/78	5
12345	cardiac	8/20/21	6/18/78	5
12345	cardiac	10/28/21	6/18/78	5
12345	ortho	1/14/21	6/18/78	2
12345	ortho	7/7/21	6/18/78	2
12345	cardiac	8/20/21	6/18/78	5
12345	cardiac	4/19/21	6/18/78	5
12345	obgyn	4/1/21	6/18/78	1
78645	neuro	5/1/21	7/18/87	2
78645	neuro	7/7/21	7/18/87	2
78645	ent	7/7/21	7/18/87	1
32423	gastro	11/7/21	3/12/57	1

I would like my desired output to produce an additional column (max) that provides highest repeated "department" per "Patient_ID", while keeping all the columns as they are, similar to below table -

Patient_ID	department	service_dte	birth_dte	count	max
12345	cardiac	1/2/21	6/18/78	5	cardiac
12345	cardiac	8/20/21	6/18/78	5	cardiac
12345	cardiac	10/28/21	6/18/78	5	cardiac
12345	ortho	1/14/21	6/18/78	2	cardiac
12345	ortho	7/7/21	6/18/78	2	cardiac
12345	cardiac	8/20/21	6/18/78	5	cardiac
12345	cardiac	4/19/21	6/18/78	5	cardiac
12345	obgyn	4/1/21	6/18/78	1	cardiac
78645	neuro	5/1/21	7/18/87	2	neuro
78645	neuro	7/7/21	7/18/87	2	neuro
78645	ent	7/7/21	7/18/87	1	neuro
32423	gastro	11/7/21	3/12/57	1	gastro

I tried below code; that provided me the count of department a patient visited in the above table -

select *
     , count(department) OVER (PARTITION BY patient_id, department) AS count
FROM table_1

however, it is not giving me desired value for patient 12345's max column, which should be cardiac. Instead of cardiac, I am getting ortho. Please kindly assist. This is the code I am trying -

select *
   , max(department) OVER (PARTITION BY patient_id) AS max
FROM table_2

Please kindly assist with an optimal way I can get the maximum column and advise on what I am or might be doing wrong.

Thank you.

CodePudding user response：

Use also FIRST_VALUE() window function in a 2nd level of aggregation:

SELECT *, FIRST_VALUE(department) OVER (PARTITION BY patient_id ORDER BY count DESC) AS max
FROM (
  SELECT *, COUNT(*) OVER (PARTITION BY patient_id, department) AS count
  FROM table_1
) t

See the demo.

CodePudding user response：

count(department) OVER (PARTITION BY patient_id, department) would be calculated for each row in your results, simply counting the number of records (where department is not null, to be specific) for the same patient & department as this row. This is not what you wanted.

max(department) OVER (PARTITION BY patient_id) AS max would also be calculated for each row in your results, simply finding the maximum of department name for the same patient as this row. This is the last department when all the departments visited by this patient are alphabetically sorted (this is how MAX() function is applied to character values), and this is not what you wanted.

We want a subquery to return the most visited department by each patient, here is an example:

select 
      tbl1.*
    , MostVisited.department as MostVisitedDepartment
from 
    table_1 tbl1
    inner join
    (-- Take only the most visited department for each patient
        select *
        from (-- rank departments by number of times 
              -- visited for each patient; highest visited gets 1
              -- the result of a tie (the patient visited two  departments the
              -- same number of times) depends on how the rdbms executes it.
            select 
                VisitCnt.*
              , row_number() over (partition by VisitCnt.patient_id
                                   order by VisitCnt.TimesVisited desc) as Seq
            from (-- Count each department visited by each patient
                select 
                    patient_id, department, count(*) as TimesVisited
                from table_1
                group by patient_id, department
                 ) VisitCnt
            ) VisitCntRnked
        where VisitCntRnked.Seq=1
    ) MostVisited
    on MostVisited.Patient_ID=tbl1.Patient_ID

As to being 'optimal': it depends on what you mean by it : The table size, indexes, statistics on the table, your particular database product, and, if you want to run this query for the entire (or most of the) population of the table, or only a small subset. We generally expect the query optimiser to take that these account in executing the query. There are many exceptions but optimising the query needs much more information than what is available in this question.