Home > Software design >  Finding the average when values are missing using SQL
Finding the average when values are missing using SQL

Time:10-05

I'm using Presto but any flavor of SQL will do.

I have a table in that format.

Group_id event_id month party time_interval
1 1 Jan Player A 1 hour
1 1 Jan Player A 2 hours
1 1 Jan Player B 1 hours
1 1 Jan Player B 1 hour
1 2 Jan Player A 3 hour

I need to get the average per group_id, per month, per party

Here's how my average should be calculated

total number of hours per group, per month, per party/total number of events per org, per month

Here's the output I should be expecting for clarity's sake:

Group_id month party avg_time_interval
1 Jan Player A 3 hours
1 Jan Player B 1 hour

Now here's the tricky part. For the first row everything makes perfect sense. We have 6 hours across both events, which we divide by 2 distinct events and get an average of 3.

However for the 2nd row, we get 1 hour instead of 2 because since the user did not get a time included we should be assuming that the interval there was 0. This means that there are still 2 unique events across that org_id, month. So the 2 hours totaled should be divided by 2 and not by 1.

This missing data essentially has made this way more complicated than it should be. Otherwise I believe running the following would've solved it

 SELECT Group_id , month, party, total/num_cases FROM(
 SELECT Group_id , month, party, SUM(time_interval) AS total, COUNT(DISTINCT(event_id)) AS num_cases
 FROM table
 GROUP BY Group_id , month, party
 )

CodePudding user response:

select    distinct Group_id
         ,month
         ,party
         ,total_hours_per_party/max(dns_rnk) over() as avg_time_interval
from     (
         select   Group_id
                 ,month
                 ,party
                 ,sum(time_interval) over(partition by party) as total_hours_per_party
                 ,dense_rank() over(order by event_id)        as dns_rnk
         from     t
         ) t
Group_id month party avg_time_interval
1 Jan Player A 3
1 Jan Player B 1

Fiddle

CodePudding user response:

You may find the count of distinct event_id values grouped by group_id, month; then join this with your table as the following:

SELECT T.Group_id, T.month, T.party
       ,SUM(T.time_interval)*1.0/ MAX(D.eid) AS avg_time_interval 
FROM tbl T
JOIN 
  (
    SELECT Group_id, month,
           COUNT(DISTINCT event_id) AS eid
    FROM tbl GROUP BY Group_id, month
  ) D
ON T.Group_id=D.Group_id AND
   T.month=D.month
GROUP BY T.Group_id,T.month,T.party
ORDER BY T.Group_id,T.month,T.party
  •  Tags:  
  • sql
  • Related