Home > Software design >  Remove duplicates only when they occur in pair per group
Remove duplicates only when they occur in pair per group

Time:02-02

In the following data:

id_1 id_2 colA colB colC
1 2 2022-01-02 scroll fast 12
1 2 2022-01-02 scroll fast 12
1 3 2022-01-03 scroll fast fast 11
1 3 2022-01-03 scroll fast fast 11
1 3 2022-01-03 scroll fast fast 11

I would like to remove duplicates only when they occur in pair. The output table should have one of the rows from the first two rows and remaining rows should be left as is.

The output table would be

id_1 id_2 colA colB colC
1 2 2022-01-02 scroll fast 12
1 3 2022-01-03 scroll fast fast 11
1 3 2022-01-03 scroll fast fast 11
1 3 2022-01-03 scroll fast fast 11

I have large dataset, so I cannot filter based on Id_1 and id_2. I am looking for a generic solution.

Test data is here at Sqlfiddle.

CodePudding user response:

We can use a combination of COUNT() and ROW_NUMBER(), both as analytic functions:

WITH cte AS (
    SELECT t.*, COUNT(*) OVER (PARTITION BY id_1, id_2) cnt,
                ROW_NUMBER() OVER (PARTITION BY id_1, id_2 ORDER BY id_1) rn
    FROM yourTable t
)

SELECT id_1, id_2, colA, colB, colC
FROM cte
WHERE cnt <> 2 OR rn = 1;

The above logic returns rows appearing only once or appearing 3 times or more. In the case of pair duplicates, it arbitrarily returns one of them.

CodePudding user response:

You can use a MERGE statement and count the size of each group using analytic functions and then delete the second row of a group when there are only 2 rows per group:

MERGE INTO table_name dst
USING (
  SELECT ROWID AS rid,
         ROW_NUMBER() OVER (PARTITION BY id_1, id_2, colA, colB, colC ORDER BY ROWNUM)
           AS rn,
         COUNT(*) OVER (PARTITION BY id_1, id_2, colA, colB, colC) AS cnt
  FROM   table_name
) src
ON (dst.ROWID = src.rid AND src.rn = 2 AND src.cnt = 2)
WHEN MATCHED THEN
  UPDATE SET colC = colC
  DELETE WHERE 1 = 1;

Which, for the sample data:

create table table_name (id_1, id_2, colA, colB, colC) AS
SELECT 1, 2, 'A', 'B', 'C' FROM DUAL UNION ALL
SELECT 1, 2, 'A', 'B', 'C' FROM DUAL UNION ALL
SELECT 1, 3, 'D', 'E', 'F' FROM DUAL UNION ALL
SELECT 1, 3, 'D', 'E', 'F' FROM DUAL UNION ALL
SELECT 1, 3, 'D', 'E', 'F' FROM DUAL;

Then after the MERGE the table contains:

ID_1 ID_2 COLA COLB COLC
1 2 A B C
1 3 D E F
1 3 D E F
1 3 D E F

fiddle

  • Related