Home > Mobile >  SQL: How to use subqueries to remove duplicates based on multiple conditions
SQL: How to use subqueries to remove duplicates based on multiple conditions

Time:12-13

First I want to remove duplicates by selecting the last date and then remove the remaining duplicates by selecting the role 'BLUE'

Example Table:

ID STATUS ROLE DATE
1 ACTIVE BLUE Oct 20 2022
1 ACTIVE RED Dec 20 2022
2 ACTIVE BLUE Feb 02 2022
2 ACTIVE RED Feb 02 2022
3 INACTIVE BLUE Dec 03 2022
4 ACTIVE RED Dec 04 2022

Expected result:

ID STATUS ROLE DATE
1 ACTIVE RED Dec 20 2022
2 ACTIVE BLUE Feb 02 2022
3 INACTIVE BLUE Dec 03 2022
4 ACTIVE RED Dec 04 2022

This is what I have so far:

SELECT a.ID,
       a.STATUS,
       a.ROLE,
       a.DATE
FROM
(
    SELECT ID, Max(DATE) as MaxDate
    FROM WorkersTest
    GROUP BY ID
) b
INNER JOIN WorkersTest as a
ON a.ID = b.ID
AND a.DATE = b.MaxDate
ORDER BY b."ID"

Then as you can see I still need to add the second filter/subquery...

CodePudding user response:

First we can use following subquery to get each id with its latest date:

SELECT id, MAX(date) AS maxDate
FROM yourtable
GROUP BY id;

This can basically be used in the whole query like this:

SELECT y.id, y.status, y.role, 
FORMAT(y.date, 'MMM dd yyyy') AS date
FROM yourtable y
JOIN
(SELECT id, MAX(date) AS maxDate
FROM yourtable
GROUP BY id) grouped
ON y.id = grouped.id
AND y.date = grouped.maxDate
ORDER BY y.id;

But this will list both "blue" and "red" role in case they both have the same latest date.

Therefore, the result would be incorrect:

ID STATUS ROLE DATE
1 ACTIVE RED Dec 20 2022
2 ACTIVE BLUE Feb 02 2022
2 ACTIVE RED Feb 02 2022
3 INACTIVE BLUE Dec 03 2022
4 ACTIVE RED Dec 04 2022

So, to also satisfy the condition to only show the "blue" row in this case, there are different options. One of them would be to use a further subquery with a window function as for example ROW_NUMBER.

This could become necessary if there are further roles.

In our specific case with two roles only, we don't need this, but can use MIN instead because "blue" appears before "red" (if we would like to get "red" rows instead, we would use MAX).

So the query is now this one:

SELECT y.id, y.status, 
MIN(y.role) AS role, 
FORMAT(y.date, 'MMM dd yyyy') AS date
FROM yourtable y
JOIN
(SELECT id, MAX(date) AS maxDate
FROM yourtable
GROUP BY id) grouped
ON y.id = grouped.id
AND y.date = grouped.maxDate
GROUP BY y.id, y.status, y.date
ORDER BY y.id;

This will produce the correct result:

ID STATUS ROLE DATE
1 ACTIVE RED Dec 20 2022
2 ACTIVE BLUE Feb 02 2022
3 INACTIVE BLUE Dec 03 2022
4 ACTIVE RED Dec 04 2022

We can replicate this here: db<>fiddle

A general hint: If possible, we should avoid to use SQL key words as column name or table name (here "role" and "date").

Especially the name "date" is also not meaningful because it misses to tell us which kind of date. We should therefore prefer clear names like for example "sellDate" or "quittingDate".

  • Related