SQL Recursion to eliminate duplicates?-CodePudding

I have a table from which I want to select values, but only if there is exactly one entry in the column "id_new". After I have selected those ids, I would update these ids and mark them as "ready" in the source table. Those ids (id_orig) would be ignored if the select would run again. The problem is, that there could be many steps needed to find all entries, which have exactly one entry for "id_new".

sample data:

id_orig	id_new
1	2
3	4
10	5
10	20
11	20
11	40
500	40

For the sample data it would mean: 1.) select all columns where there is only one entry for id_new. 2.) redo the same sql but ignore the columns where id_orig was already selected. This would result in finding (11, 20). I could repeat this and would find (500, 40) in the third step. I also wrote a long sql to find the same rows, but I am wondering if you can achieve the same result in a simpler way, maybe recursive?

select * from mytable where id_new in (select id_new from mytable group by id_new having count(*) = 1)
union
select a.* from mytable  a
where a.id_new in (select id_new from mytable where id_orig in 
(select id_orig from mytable where id_new in 
(select id_new from mytable group by id_new having count(*) = 1)))
and a.id_orig not in (
select id_orig from mytable where id_new in (select id_new from mytable group by id_new having count(*) = 1))
union all
select a.* from mytable  a
 where a.id_new in ( select id_new from mytable where id_orig in (select id_orig from mytable where id_new in (
select a.id_new from mytable  a
 where a.id_new in (select id_new from mytable where id_orig in 
(select id_orig from mytable where id_new in 
(select id_new from mytable group by id_new having count(*) = 1)))
and a.id_orig not in (
select id_orig from mytable where id_new in (select id_new from mytable group by id_new having count(*) = 1))
)))
and a.id_orig not in (
select id_orig from mytable where id_new in (select id_new from mytable group by id_new having count(*) = 1)
union all
select a.id_orig from mytable  a
 where a.id_new in (select id_new from mytable where id_orig in 
(select id_orig from mytable where id_new in 
(select id_new from mytable group by id_new having count(*) = 1)))
and a.id_orig not in (
select id_orig from mytable where id_new in (select id_new from mytable group by id_new having count(*) = 1))
)

CodePudding user response：

Maybe a window function would help eg ROW_NUMBER() (see documentation).

  select id_orig, id_new
  , row_number() over ( partition by id_orig order by id_new ) rn_
  from t ;

  -- result
  id_orig   id_new  rn_
   1        2       1
   3        4       1
   10       5       1
   10       20      2   -- 2 rows for this id!
   11       20      1
   11       40      2   -- 2 rows for this id!
   500      40      1

Once you have this result, you can eliminate all the extraneous rows by using the initial query as a subquery, and adding a WHERE clause:

select *
from (
  select id_orig, id_new
  , row_number() over ( partition by id_orig order by id_new ) rn_
  from t
) sq
where rn_ = 1 ; 

--result
id_orig id_new  rn_
1       2       1
3       4       1
10      5       1
11      20      1
500     40      1

(If you don't want the rn_ column in the output, just write SELECT id_orig, id_ new ... in the outer query.)

DBfiddle here.

CodePudding user response：

You can UNION, all single number with all Dup0licates, that have the higher id_orig.

  select MIN(t.id_orig) as id_orig, t.id_new
  from t 
  GROUP BY t.id_new
  HAVING COUNT(*) = 1
  UNION
  select t.id_orig, t.id_new
  from t INNER JOIN t as t1 ON t.id_new = t1.id_new AND t.id_orig > t1.id_orig;

id_orig | id_new
------: | -----:
      1 |      2
      3 |      4
     10 |      5
     11 |     20
    500 |     40

db<>fiddle here