How to do conditional aggregate based on values of another column?-CodePudding

I have a table called device_data that looks like below

     Column     |            Type             | Collation | Nullable |                     Default                     
---------------- ----------------------------- ----------- ---------- -------------------------------------------------
 id             | integer                     |           |          | nextval('device_data_id_seq'::regclass)
 date           | timestamp without time zone |           |          | 
 packet_drop    | real                        |           |          | 
 jitter         | real                        |           |          | 
 latency        | real                        |           |          | 
 alert          | character varying(50)       |           |          |

It basically stores the packet drops, jitter, latency on a minute basis.

Now there is another column called alert that holds the value HIGH, MEDIUM and LOW based on a threshold and holds empty string if the threshold is not met.

Now I do an hourly average on the table, my query looks like below

select date_trunc('hour', date) as hourly, avg(jitter), avg(latency), avg(packet_drop) from device_data where date between '2022-11-26 17:41:11' and '2022-11-27 17:41:11' group by hourly order by hourly;

Output

       hourly        |        avg         |        avg         |         avg         
--------------------- -------------------- -------------------- ---------------------
 2022-11-26 17:00:00 |   3.52857138642243 |  2.771428568022592 |                   0
 2022-11-26 18:00:00 |  2.484615419346553 |  2.815384602546692 |                   0
 2022-11-26 19:00:00 |  2.218461540570626 |  2.723076921242934 |                   0
 2022-11-26 20:00:00 |  5.098461512992015 | 2.7076923021903405 |                   0
 2022-11-26 21:00:00 | 2.0060606116824076 | 2.6469696814363655 |                   0
 2022-11-26 22:00:00 |  5.672307815345434 |  2.810769222332881 |                   0
 2022-11-26 23:00:00 | 2.7828124976949766 |  2.893749985843897 |                   0
 2022-11-27 00:00:00 | 2.6046153992414474 | 2.8030769238105187 |                   0
 2022-11-27 01:00:00 |  3.846031717837803 | 2.8333333200878568 |                   0
 ...
(25 rows)

Next I do an hourly average on the table with additional column alert, so my query looks something like below

select date_trunc('hour', date) as hourly, avg(jitter), avg(latency), avg(packet_drop), alert from device_data where date between '2022-11-26 17:41:11' and '2022-11-27 17:41:11' group by hourly, alert order by hourly;

Output

       hourly        |        avg         |        avg         |         avg         | alert 
--------------------- -------------------- -------------------- --------------------- ----------------
 2022-11-26 17:00:00 | 1.2649999938905239 |  2.755000001192093 |                   0 | 
 2022-11-26 17:00:00 |  48.79999923706055 | 3.0999999046325684 |                   0 | MEDIUM
 2022-11-26 18:00:00 |  2.484615419346553 |  2.815384602546692 |                   0 | 
 2022-11-26 19:00:00 |  2.218461540570626 |  2.723076921242934 |                   0 | 
 2022-11-26 20:00:00 | 2.6603174461495307 |     2.695238092589 |                   0 | 
 2022-11-26 20:00:00 |  106.5999984741211 | 3.0999999046325684 |                   0 | HIGH
 2022-11-26 20:00:00 |  57.20000076293945 | 3.0999999046325684 |                   0 | MEDIUM
 2022-11-26 21:00:00 | 2.0060606116824076 | 2.6469696814363655 |                   0 | 
 2022-11-26 22:00:00 | 2.8349206649831364 |  2.793650784189739 |                   0 | 
 2022-11-26 22:00:00 |  95.05000305175781 |  3.350000023841858 |                   0 | HIGH
 2022-11-26 23:00:00 | 2.7828124976949766 |  2.893749985843897 |                   0 | 
 2022-11-27 00:00:00 |  2.132812526775524 |           2.796875 |                   0 | 
 2022-11-27 00:00:00 |  32.79999923706055 |  3.200000047683716 |                   0 | LOW
 2022-11-27 01:00:00 |  1.849999995962266 |  2.822580631702177 |                   0 | 
 2022-11-27 01:00:00 |  127.5999984741211 |                3.5 |                   0 | HIGH
 ...
(35 rows)

As you can see in my second query, some of the rows have duplicate timestamps like 2022-11-26 17:00:00, 2022-11-26 20:00:00 and so on.

I understand that when averaging out the columns, in the case of alert column, it encounters multiple values be it empty string along with some HIGH, MEDIUM or LOW strings or some other combinations resulting in multiple rows for the same timestamp.

Requirement

What I want is to simply average out the values (initially ignoring the alert column values) and then check for HIGH string in the alert column for that particular hour (I am doing hourly average). If the string is present, simply assign HIGH for that row in the alert column. If no HIGH string found, just assign empty string for that row in the alert column. Something like below

Expected output

       hourly        |        avg         |        avg         |         avg         | alert 
--------------------- -------------------- -------------------- --------------------- ----------------
 2022-11-26 17:00:00 |  3.52857138642243   |  2.771428568022592 |                   0 | 
 2022-11-26 18:00:00 |  2.484615419346553  |  2.815384602546692 |                   0 | 
 2022-11-26 19:00:00 |  2.218461540570626  |  2.723076921242934 |                   0 | 
 2022-11-26 20:00:00 |  5.098461512992015  | 2.7076923021903405 |                   0 |  HIGH
 2022-11-26 21:00:00 |  2.0060606116824076 | 2.6469696814363655 |                   0 | 
 2022-11-26 22:00:00 |  5.672307815345434  |  2.810769222332881 |                   0 |  HIGH
 2022-11-26 23:00:00 |  2.7828124976949766 |  2.893749985843897 |                   0 | 
 2022-11-27 00:00:00 |  2.6046153992414474 | 2.8030769238105187 |                   0 | 
 2022-11-27 01:00:00 |  3.846031717837803  | 2.8333333200878568 |                   0 |  HIGH
 ...
(25 rows)

I thought of only targeting rows containing HIGH value in alert column by doing something like where alert = 'HIGH', but then it will drop other rows and affect my average calculation.

How can I do this?

CodePudding user response：

You need to use an aggregate function for the column 'alert', as well. Since you want to only bring up a certain value, you can use a conditional expression inside the aggregate, like this:

select 
     date_trunc('hour', date) as hourly
   , avg(jitter)
   , avg(latency)
   , avg(packet_drop)
   , coalesce(MAX(case when alert='HIGH' then 'HIGH' end),'') as alert 
from device_data 
where date between '2022-11-26 17:41:11' and '2022-11-27 17:41:11' 
group by hourly order by hourly;

This should work, because the MAX function ignores the NULL values (all alert values other than HIGH will cause NULL result in the case expression.