Home > Enterprise >  SQL - Finding Duplicate Records based certain criteria
SQL - Finding Duplicate Records based certain criteria

Time:12-06

I have these records in the table - employee_projects

id employee_id project_id status
1 emp1 proj1 VERIFIED
2 emp2 proj2 REJECTED
3 emp1 proj1 VERIFIED
4 emp1 proj3 REJECTED
5 emp2 proj2 REQUIRED
6 emp3 proj4 SUBMITTED
7 emp4 proj5 VERIFIED
8 emp4 proj6 VERIFIED
9 emp3 proj4 REQUIRED

Here are the criteria for determining duplicates:

  1. Same employee ID, same project ID under the same status (Example: rows 1 and 3 are duplicates)
  2. Same employee ID, same project ID but in different status (Example: rows 6 and 9 are duplicates). An exception to duplication criteria#2 is if one project is REQUIRED and the same project is also REJECTED under the same employee, this is NOT considered a duplicate. For example, rows 2 and 5 are NOT duplicates.

I have a query for the first criterion:

select
    emp_id,
    proj_id,
    status,
    COUNT(*)
from
    employee_projects
group by
    emp_id,
    proj_id,
    status
having
    COUNT(*) > 1

What I'm struggling to construct is the SQL for the second criterion.

CodePudding user response:

maybe a self join can help you.

with t (employee_id ,project_id,status)
as
(
select 'emp1',  'proj1' ,   'VERIFIED'
Union all select 'emp2',    'proj2' ,   'REJECTED'
Union all select 'emp1',    'proj1' ,   'VERIFIED'
Union all select 'emp1',    'proj3' ,   'REJECTED'
Union all select 'emp2',    'proj2' ,   'REQUIRED'
Union all select 'emp3',    'proj4' ,   'SUBMITTED'
Union all select 'emp4',    'proj5' ,   'VERIFIED'
Union all select 'emp4',    'proj6' ,   'VERIFIED'
Union all select 'emp3',    'proj4' ,   'REQUIRED'
)
select
    t.employee_id,
    t.project_id,
    t.status,
    '' as status,
    'criteria#1' as SQL
from
    t
group by
    t.employee_id,
    t.project_id,
    t.status
having
    COUNT(*) > 1
union all
SELECT 
    t.employee_id,
    t.project_id,
    t.status,
    a.status,
    'criteria#2' as SQL
FROM
    t
    left join t as a on 
        t.employee_id = a.employee_id and
        t.project_id = a.project_id
where 
    t.status != a.status and
    concat(t.status,a.status) != 'REQUIREDREJECTED' and
    concat(t.status,a.status) != 'REJECTEDREQUIRED'

CodePudding user response:

Try the following:

select T.emp_id, T.proj_id, T.status, D.dup_cnt
from employee_projects T join
(
  select emp_id, proj_id, count(*) as dup_cnt
  from employee_projects
  group by emp_id, proj_id
  having count(*) > 1 and
    count(distinct case when status in ('REQUIRED', 'REJECTED') then status end) < 2
) D
on T.emp_id = D.emp_id and T.proj_id = D.proj_id
order by T.emp_id, T.proj_id

If you want to consider an employee with statuses ('REQUIRED', 'REJECTED', any other statuses) as duplicate, modify the having clause as the following:

select T.emp_id, T.proj_id, T.status, D.dup_cnt
from employee_projects T join
(
  select emp_id, proj_id, count(*) as dup_cnt
  from employee_projects
  group by emp_id, proj_id
  having count(*) > 1 and
    (count(distinct case when status in ('REQUIRED', 'REJECTED') then status end) < 2 or count(distinct status) > 2)
) D
on T.emp_id = D.emp_id and T.proj_id = D.proj_id
order by T.emp_id, T.proj_id

See a demo.

  •  Tags:  
  • sql
  • Related