Home > Back-end >  How to handle null and text values when finding average in postgres?
How to handle null and text values when finding average in postgres?

Time:10-20

I have a table that looks like below (only a sample provided)

            date            |  memory   |      cpu       
---------------------------- ----------- ----------------
 2022-10-06 02:03:30.102194 |        37 | -1
 2022-10-06 02:03:30.106863 |        37 | -1
 2022-10-06 02:03:30.110286 |        37 | -1
 2022-10-06 02:03:30.114442 |        37 | -1
 2022-10-06 02:03:30.118419 |        37 | -1
 2022-10-06 02:03:30.121983 | 31.519346 | 7.0, 7.0, 7.0
 2022-10-06 02:03:30.125608 | 31.519346 | 7.0, 7.0, 7.0
 2022-10-06 02:03:30.12989  | 31.519346 | 7.0, 7.0, 7.0
 2022-10-06 02:03:30.134142 | 31.519346 | 7.0, 7.0, 7.0
 2022-10-06 02:03:30.137253 | 31.519346 | 7.0, 7.0, 7.0
 2022-10-06 02:03:30.143885 | 31.542543 | 24.0
 2022-10-06 02:03:30.148455 | 31.542543 | 24.0
 2022-10-06 02:03:30.151906 | 30.906942 | 11.0
 2022-10-06 02:03:30.155327 | 35.317715 | 12.0, 6.0
 2022-10-06 02:03:30.161415 | 35.317715 | 12.0, 6.0
 2022-10-06 02:03:30.16701  | 35.317715 | 12.0, 6.0
 2022-10-06 02:03:30.170698 | 35.317715 | 12.0, 6.0
 2022-10-06 02:03:30.174565 | 17.155323 | 1.0
 2022-10-06 02:03:30.177928 | 17.155323 | 1.0
 2022-10-06 02:03:30.181603 | 17.155323 | 1.0
 2022-10-06 02:03:30.185589 | 32.224056 | 14.0
 2022-10-06 02:03:30.188934 | 32.224056 | 14.0
 2022-10-06 02:03:30.193468 |  7.368925 | 1.0
 2022-10-06 02:03:30.198035 |  7.368925 | 1.0
 2022-10-06 02:03:30.201166 | 31.726305 | 27.0
 2022-10-06 02:03:30.205302 |        -1 | 4.0, 3.0, 3.0
 2022-10-06 02:03:30.208874 |        -1 | 4.0, 3.0, 3.0
 2022-10-06 02:03:30.212153 |        -1 | 4.0, 3.0, 3.0
 2022-10-06 02:03:30.215573 |  18.09042 | 1.0
 2022-10-06 02:03:30.219527 |   41.8536 | 11.0, 1.0, 1.0
 2022-10-06 02:03:30.222841 |   41.8536 | 11.0, 1.0, 1.0
 2022-10-06 02:03:30.226778 | 27.492594 | 23.0
 2022-10-06 02:03:30.230332 |  8.857008 | 2.0
 2022-10-06 02:03:30.234988 |  8.857008 | 2.0
 2022-10-06 02:03:30.238463 |  8.857008 | 2.0
 2022-10-06 02:03:30.242749 | 20.523735 | 4.0, 2.0
 2022-10-06 02:03:30.246553 | 20.523735 | 4.0, 2.0
 2022-10-06 02:03:30.250473 | 20.523735 | 4.0, 2.0
 2022-10-06 02:03:30.254418 | 20.523735 | 4.0, 2.0
 2022-10-06 02:03:30.25883  | 20.523735 | 4.0, 2.0
 2022-10-06 02:03:30.26222  | 29.214113 | 13.0
 2022-10-06 02:03:30.265766 | 29.214113 | 13.0

As you can see this table stores cpu and memory readings along with date. The cpu columns contains comma separated values. So I split it out into individual columns like so

SQL

SELECT
  date,
  memory,
  split_part(cpu, ',', 1) AS cpu1,
  split_part(cpu, ',', 2) AS cpu2,
  split_part(cpu, ',', 3) AS cpu3,
  split_part(cpu, ',', 4) AS cpu4
FROM
  webhook_devicedata
where
  date between '2022-10-06 02:03:00'
  and '2022-10-06 02:04:00'
order by
  date;

SQL Output

            date            |  memory   | cpu1 | cpu2 | cpu3 | cpu4 
---------------------------- ----------- ------ ------ ------ ------
 2022-10-06 02:03:30.102194 |        37 | -1   |      |      | 
 2022-10-06 02:03:30.106863 |        37 | -1   |      |      | 
 2022-10-06 02:03:30.110286 |        37 | -1   |      |      | 
 2022-10-06 02:03:30.114442 |        37 | -1   |      |      | 
 2022-10-06 02:03:30.118419 |        37 | -1   |      |      | 
 2022-10-06 02:03:30.121983 | 31.519346 | 7.0  |  7.0 |  7.0 | 
 2022-10-06 02:03:30.125608 | 31.519346 | 7.0  |  7.0 |  7.0 | 
 2022-10-06 02:03:30.12989  | 31.519346 | 7.0  |  7.0 |  7.0 | 
 2022-10-06 02:03:30.134142 | 31.519346 | 7.0  |  7.0 |  7.0 | 
 2022-10-06 02:03:30.137253 | 31.519346 | 7.0  |  7.0 |  7.0 | 
 2022-10-06 02:03:30.143885 | 31.542543 | 24.0 |      |      | 
 2022-10-06 02:03:30.148455 | 31.542543 | 24.0 |      |      | 
 2022-10-06 02:03:30.151906 | 30.906942 | 11.0 |      |      | 
 2022-10-06 02:03:30.155327 | 35.317715 | 12.0 |  6.0 |      | 
 2022-10-06 02:03:30.161415 | 35.317715 | 12.0 |  6.0 |      | 
 2022-10-06 02:03:30.16701  | 35.317715 | 12.0 |  6.0 |      | 
 2022-10-06 02:03:30.170698 | 35.317715 | 12.0 |  6.0 |      | 
 2022-10-06 02:03:30.174565 | 17.155323 | 1.0  |      |      | 
 2022-10-06 02:03:30.177928 | 17.155323 | 1.0  |      |      | 
 2022-10-06 02:03:30.181603 | 17.155323 | 1.0  |      |      | 
 2022-10-06 02:03:30.185589 | 32.224056 | 14.0 |      |      | 
 2022-10-06 02:03:30.188934 | 32.224056 | 14.0 |      |      | 
 2022-10-06 02:03:30.193468 |  7.368925 | 1.0  |      |      | 
 2022-10-06 02:03:30.198035 |  7.368925 | 1.0  |      |      | 
 2022-10-06 02:03:30.201166 | 31.726305 | 27.0 |      |      | 
 2022-10-06 02:03:30.205302 |        -1 | 4.0  |  3.0 |  3.0 | 
 2022-10-06 02:03:30.208874 |        -1 | 4.0  |  3.0 |  3.0 | 
 2022-10-06 02:03:30.212153 |        -1 | 4.0  |  3.0 |  3.0 | 
 2022-10-06 02:03:30.215573 |  18.09042 | 1.0  |      |      | 
 2022-10-06 02:03:30.219527 |   41.8536 | 11.0 |  1.0 |  1.0 | 
 2022-10-06 02:03:30.222841 |   41.8536 | 11.0 |  1.0 |  1.0 | 
 2022-10-06 02:03:30.226778 | 27.492594 | 23.0 |      |      | 
 2022-10-06 02:03:30.230332 |  8.857008 | 2.0  |      |      | 
 2022-10-06 02:03:30.234988 |  8.857008 | 2.0  |      |      | 
 2022-10-06 02:03:30.238463 |  8.857008 | 2.0  |      |      | 
 2022-10-06 02:03:30.242749 | 20.523735 | 4.0  |  2.0 |      | 
 2022-10-06 02:03:30.246553 | 20.523735 | 4.0  |  2.0 |      | 
 2022-10-06 02:03:30.250473 | 20.523735 | 4.0  |  2.0 |      | 
 2022-10-06 02:03:30.254418 | 20.523735 | 4.0  |  2.0 |      | 
 2022-10-06 02:03:30.25883  | 20.523735 | 4.0  |  2.0 |      | 
 2022-10-06 02:03:30.26222  | 29.214113 | 13.0 |      |      | 
 2022-10-06 02:03:30.265766 | 29.214113 | 13.0 |      |      | 

Now I want to group by date and get average of memory and individual cpu columns on an hourly basis. If you look carefully, the column cpu1 has all the values. So running the query below works

SQL

SELECT
  date_trunc('hour', date) as hourly,
  avg(COALESCE(split_part(cpu, ',', 1) :: text) :: float) AS cpu1
FROM
  webhook_devicedata
where
  date between '2022-10-06 02:03:00'
  and '2022-10-06 02:04:00'
group by
  hourly;

I get the result

       hourly        |       cpu1        
--------------------- -------------------
 2022-10-06 02:00:00 | 7.309523809523809

But as soon as I start involving other cpu columns, I get an error

SQL

SELECT
  date_trunc('hour', date) as hourly,
  avg(COALESCE(split_part(cpu, ',', 1) :: text) :: float) AS cpu1,
  avg(COALESCE(split_part(cpu, ',', 2) :: text) :: float) AS cpu2
FROM
  webhook_devicedata
where
  date between '2022-10-06 02:03:00'
  and '2022-10-06 02:04:00'
group by
  hourly;

Error

ERROR:  invalid input syntax for type double precision: ""
CONTEXT:  parallel worker

I even tried to get rid of null and empty strings by writing the query like below

SQL

SELECT
  date_trunc('hour', sub2.date) as hourly,
  avg(sub2.cpu1) as avg_cpu1,
  avg(sub2.cpu2) as avg_cpu2
from
  (
    select
      date,
      memory,
      split_part(cpu, ',', 1) :: text as cpu1,
      split_part(cpu, ',', 2) :: text as cpu2
    FROM
      (
        select
          date,
          memory,
          cpu
        from
          webhook_devicedata
        where
          date between '2022-10-06 02:03:00'
          and '2022-10-06 02:04:00'
          and cpu != ''
          and cpu is not null
      ) as sub
  ) as sub2
group by
  hourly;

But I get the below error

Error

ERROR:  function avg(text) does not exist
LINE 1: SELECT date_trunc('hour', sub2.date) as hourly, avg(sub2.cpu...
                                                        ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

What am I doing wrong? How can I get the average of all the columns?

CodePudding user response:

You must check values using case, example:

SELECT
  date_trunc('hour', date) as hourly,
  avg(case when split_part(cpu, ',', 1) = '' then 0 else split_part(cpu, ',', 1):: float end) AS cpu1,
  avg(case when split_part(cpu, ',', 2) = '' then 0 else split_part(cpu, ',', 2):: float end) AS cpu2
FROM
  webhook_devicedata
where
  date between '2022-10-06 02:03:00'
  and '2022-10-06 02:04:00'
group by
  hourly;
  • Related