Home > Software design >  Trying to delete duplicate rows in SQL Server where the difference is the date or batch number
Trying to delete duplicate rows in SQL Server where the difference is the date or batch number

Time:11-14

I have this query:

SELECT 
    T1.ID_NUMBER,                                                                    
    T1.INCEPTION_DATE,
    T1.OCCURENCE,
    T1.TRANSACTION_DATE,
    T1.FILE_LOAD_DATE,
    T1.BATCH_NUM
FROM 
    mastertable T1
INNER JOIN 
    (SELECT 
         ID_NUMBER, INCEPTION_DATE, OCCURENCE, 
         COUNT(*) AS DUPL_COUNT
     FROM 
         mastertable
     WHERE 
         SOURCE_SYSTEM ='LEGACY'
     GROUP BY 
         ID_NUMBER, INCEPTION_DATE, OCCURENCE
     HAVING 
         COUNT(*) > 1) t2 ON T2.ID_NUMBER = T1.ID_NUMBER 
                          AND T2.INCEPTION_DATE = T1.INCEPTION_DATE 
                          AND T2.OCCURENCE= T1.OCCURENCE
ORDER BY 
    1, 2, 3, 4, 5

Which is returning the following results

ID_NUMBER INCEPTION_DATE OCCURENCE TRANSACTION_DATE FILE_LOAD_DATE BATCH_NUM
112897732 2008-09-15 4 2008-07-03 2008-07-07 17:57:19 06341
112897732 2008-09-15 4 2008-07-13 2008-07-18 03:35:55 06753
828194721 2008-11-11 1 2008-09-06 2008-09-17 02:50:44 97334
828194721 2008-11-11 1 2008-09-23 2008-09-24 02:55:27 98331
456457422 2008-09-28 1 2008-12-03 2008-07-13 08:08:39 00734
456457422 2008-09-28 1 2008-12-03 2008-07-18 13:35:55 00991
999272910 2008-05-07 3 2008-05-03 2008-10-13 08:08:38 11432
999272910 2008-05-07 3 2008-05-28 2008-10-18 03:35:55 13342
875328642 2008-03-01 3 2008-04-28 2008-01-23 08:08:38 74542
875328642 2008-03-01 3 2008-04-30 2008-01-25 12:55:11 77536
011028734 2008-07-12 2 2008-12-03 2008-08-07 11:57:03 23422
011028734 2008-07-12 2 2008-12-03 2008-08-11 17:23:29 25748
018264981 2008-07-09 0 2008-12-03 2008-12-07 02:18:12 00432
018264981 2008-07-09 0 2008-12-03 2008-12-11 17:44:19 00773

The earlier FILE_LOAD_DATE of each ID_NUMBER or lesser BATCH_NUM is the record I want to keep.

Is there a way to write a query that will delete the other records, perhaps using a CTE with ROW_NUMBER()?

I am hoping for something that is DRY in case this issue happens again. Thanks!

(Also if it isn't too much trouble please explain how the solution works)

CodePudding user response:

You could use a deletable CTE here:

WITH cte AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY ID_NUMBER, INCEPTION_DATE, OCCURENCE
                                 ORDER BY FILE_LOAD_DATE, BATCH_NUM) rn
    FROM mastertable
    WHERE SOURCE_SYSTEM = 'LEGACY'
)

DELETE
FROM cte
WHERE rn > 1;

The logic here is to assign a row number to each group of records having the same values for ID_NUMBER, INCEPTION_DATE, and OCCURENCE. The first row number value of 1 will be assigned to the record having the earliest FILE_LOAD_DATE. In cases of two or more records tied for the earliest FILE_LOAD_DATE, the tie breaker will be determined by the earliest BATCH_NUM.

The delete statement removes all records except for this earliest record.

  • Related