Home > Mobile >  Remove duplicates using multiple criteria in SQL
Remove duplicates using multiple criteria in SQL

Time:11-03

I've tried all afternoon to dedup a table that looks like this:

ID1 | ID2 | Date       | Time            |Status  | Price 
---- ----- ------------ ----------------- -------- -------
01  | A   | 01/01/2022 | 10:41:47.000000 | DDD    | 55
01  | B   | 02/01/2022 | 16:22:31.000000 | DDD    | 53
02  | C   | 01/01/2022 | 08:54:03.000000 | AAA    | 72
02  | D   | 03/01/2022 | 11:12:35.000000 | DDD    | 
03  | E   | 01/01/2022 | 17:15:41.000000 | DDD    | 67
03  | F   | 01/01/2022 | 19:27:22.000000 | DDD    | 69
03  | G   | 02/01/2022 | 06:45:52.000000 | DDD    | 78

Basically, I need to dedup based on two conditions:

  1. Status: where AAA > BBB > CCC > DDD. So, pick the highest one.
  2. When the Status is the same given the same ID1, pick the latest one based on Date and Time.

The final table should look like:

ID1 | ID2 | Date       | Time            |Status  | Price 
---- ----- ------------ ----------------- -------- -------
01  | B   | 02/01/2022 | 16:22:31.000000 | DDD    | 53
02  | C   | 01/01/2022 | 08:54:03.000000 | AAA    | 72
03  | G   | 02/01/2022 | 06:45:52.000000 | DDD    | 78

Is there a way to do this in Redshift SQL / PostgreSQL?

I tried variations of this, but everytime it doesn't work because it demands that I add all columns to the group by, so then it defeats the purpose

select a.id1,
        b.id2,
        b.date,
       b.time,
       b.status,
       b.price,
       case when (status = 'AAA') then 4
            when (status = 'BBB') then 3
            when (status= 'CCC') then 2
            when (status = 'DDD') then 1
            when (status = 'EEE') then 0
                else null end as row_order
from table1 a
left join table2 b
on a.id1=b.id1
group by id1
having row_order = max(row_order)
   and date=max(date)
   and time=max(time)

Any help at all is appreciated!

CodePudding user response:

Windowing functions are good at this:

SELECT ID1, ID2, Date, Time, Status, Price
FROM (
    SELECT *,
        row_number() OVER (PARTITION BY ID1 ORDER BY Status, Date DESC, Time DESC) rn
    FROM MyTable
) t
WHERE rn = 1

See it work here:

https://dbfiddle.uk/uAvDz1Qn

CodePudding user response:

You can use ROW_NUMBER() like so:

with cte as (
select a.id1,
       b.id2,
       b.date,
       b.time,
       b.status,
       b.price,
       ROW_NUMBER() OVER (PARTITION BY a.id1 ORDER BY b.status ASC, b.date DESC, b.time DESC) RN
from table1 a
left join table2 b on a.id1=b.id1
)

select * from cte where rn = 1

CodePudding user response:

This is a typical top-1-per-group problem. The canonical solution indeed involves window functions, as demonstrated by Joel Coehoorn and Aaron Dietz.

But Postgres has a specific extension, called distinct on, which is built exactly for the purpose of solving top-1-per-group problems. The syntax is neater, and you benefit built-in optimizations:

select distinct on (id1) t.*
from mytable t
order by id1, status, "Date" desc, "Time" desc

Here is a demo on DB Fiddle based on that of Joel Coehoorn.

  • Related