Home > Net >  Count distinct by boolean value
Count distinct by boolean value

Time:02-21

Is there a better (prettier, more idiomatic, or even performant) way to do the following?

Objective: calculate distinct values for a column by another boolean column.

Sample data:

id  | metadata_streaming_date | cols_exist |
--- | ----------------------- | -----------|
 1  | 2022-02-20              | true       |
 1  | 2022-02-20              | true       |
 2  | 2022-02-20              | true       |
 2  | 2022-02-20              | true       |
 3  | 2022-02-20              | false      |
 1  | 2022-02-19              | true       |
 2  | 2022-02-19              | false      |
 3  | 2022-02-19              | false      |
 4  | 2022-02-19              | false      |
 4  | 2022-02-19              | false      |

Expected result is to count distinct id grouped by metadata_streaming_date split by wanted (where cols_exist = false) and overall (all rows for this id per date).

Result table expected:

| metadata_streaming_date | wanted | overall |
| ----------------------- | -------| --------|
| 2022-02-20              | 1      | 3       |
| 2022-02-19              | 3      | 4       |

I can achieve it through two sub-queries and inner-joining then by metadata_streaming_date:

select
  t1.metadata_streaming_date,
  overall,
  wanted,
  wanted / overall as perc
from
  (
    select
      metadata_streaming_date,
      count(distinct id) as overall
    from
      non_needed_fields_view
    where
      metadata_streaming_date >= '2022-02-19'
    group by
      metadata_streaming_date
  ) as t1
  inner join (
    select
      metadata_streaming_date,
      count(distinct id) as wanted
    from
      non_needed_fields_view
    where
      cols_exist is false
      and metadata_streaming_date >= '2022-02-19'
    group by
      metadata_streaming_date
  ) as t2 on t1.metadata_streaming_date = t2.metadata_streaming_date

CodePudding user response:

You can try to use the aggregate condition function with DISTINCT, let your logic in CASE WHEN expression.

SELECT metadata_streaming_date,
       COUNT(DISTINCT CASE WHEN cols_exist = false THEN id END) wanted ,
       COUNT(DISTINCT id) overall 
FROM non_needed_fields_view
WHERE metadata_streaming_date >= '2022-02-19'
GROUP BY metadata_streaming_date 

CodePudding user response:

  1. There is a cool FILTER syntax for aggregate functions, currently supported by some RDBMS / SQL engines including Spark SQL, PostgreSQL & SQLite. As far as I remember it is part of the SQL ISO standard.
  2. The ISO syntax for date in SQL is DATE 'yyyy-MM-dd'

select   metadata_streaming_date 
        ,count(distinct id) filter (where cols_exist = false) as wanted
        ,count(distinct id)                                   as overall
from     non_needed_fields_view
where    metadata_streaming_date >= date '2022-02-19'
group by metadata_streaming_date 

 ----------------------- ------ ------- 
|metadata_streaming_date|wanted|overall|
 ----------------------- ------ ------- 
|             2022-02-19|     3|      4|
|             2022-02-20|     1|      3|
 ----------------------- ------ ------- 
  • Related