Home > Software design >  Count duplicate records using with a common table expression
Count duplicate records using with a common table expression

Time:04-20

The data returned as dataset in the CTE below looks like:

|date       | rows_added |
|-----------|------------|
|2022-04-18 | 100        |
|-----------|------------|
|2022-04-17 | 200        |
|-----------|------------|
|2022-04-17 | 600        |
|-----------|------------

How can I incorporate a count of the duplicate records, by date, in the following CTE?

with dataset as (
  SELECT 
      date,
      COUNT(*) as rows_added
  FROM
      my_table
  WHERE 
      date between '2022-01-01 00:00:00'
      AND '2022-04-18 00:00:00'
  GROUP BY 
      date
 )
SELECT
    COUNT(*) as total_days_in_result_set,
    COUNT(DISTINCT rows_added) as total_days_w_distinct_record_counts,
    COUNT(*) - COUNT(DISTINCT rows_added) as toal_days_w_duplicate_record_counts,
FROM dataset

If I was going to only count the duplicate dates I would use the following but I can't incorporate it into the CTE above:

SELECT date, COUNT(date)
FROM my_table 
GROUP BY date 
HAVING COUNT(date) >1 

Desired output given the example above:

total_days_in_result_set | total_days_w_distinct_record_counts | toal_days_w_duplicate_record_counts | duplicate_dates |
----------------------------------------------------------------------------------------------------------------------------
3                        | 3                                   | 0 | 2           

CodePudding user response:

Here's my solution:

WITH DATA AS (with dataset as (
  SELECT 
      CAST(date as date),
      COUNT(*) as rows_added
  FROM
      my_table
  WHERE 
      date between '2022-04-17 00:00:00'
      AND '2022-04-18 00:00:00'
  GROUP BY 
      date
 )
SELECT
    COUNT(*) as total_days_in_result_set,
    COUNT(DISTINCT rows_added) as total_days_w_distinct_record_counts,
    COUNT(*) - COUNT(DISTINCT rows_added) as toal_days_w_duplicate_record_counts,
    CASE WHEN COUNT(date) > 1 THEN 'YES' ELSE 'NO' END AS duplicate_dates
    FROM dataset
)
SELECT 
    total_days_in_result_set,
    total_days_w_distinct_record_counts,
    toal_days_w_duplicate_record_counts,
    COUNT(*) FILTER(WHERE duplicate_dates = 'YES') as count_of_duplicate_dates
    
FROM DATA 
GROUP BY
    total_days_in_result_set,
    total_days_w_distinct_record_counts,
    toal_days_w_duplicate_record_counts

CodePudding user response:

I think if you use the filter in your final query, you can achieve that:

SELECT
    COUNT(*) as total_days_in_result_set,
    COUNT(DISTINCT rows_added) as total_days_w_distinct_record_counts,
    COUNT(*) - COUNT(DISTINCT rows_added) as toal_days_w_duplicate_record_counts,
    count (*) filter (where rows_added > 1) as duplicate_Dates
FROM dataset

The final field called "duplicate dates" is the example.

  • Related