Deciding Between Duplicate Records Based on Status and Date-CodePudding

I am stuck on the best way to handle this. I am importing a file and building a data cleaning process. I have no choice but to accept this bad data. The file has some duplicate account numbers.

If the duplicate accountNum contains the same statusID, then I need to take the earliest date, but if the duplicate accountNum contains a different statusID then I need to take the latest date.

My thought is to separate the records into multiple temp tables and finally union the cleaned dups with the non dups... but that seems inefficient.

This is where I am starting:

SELECT accountNum,
       statusID,
       date,
       row_number() OVER (
                  PARTITION BY accountNum
                  ORDER BY date DESC 
           ) acct_dup_row_num,
       row_number() OVER (
                  PARTITION BY CONCAT(accountNum,statusID)
                  ORDER BY date ASC 
           ) status_dup_row_num    
INTO #tempTable1
FROM file
GROUP BY accountNum,
       statusID,
       date
HAVING COUNT(accountNum) > 1

CodePudding user response：

use a derived table to count the distinct statusID. Based on the distinct count and order the [date] ASC or DESC.

select f.accountNum,
       f.statusID,
       f.[date],
       r = row_number() over (partition by f.accountNum
                                  order by case when s.cnt = 1 then f.[date] end asc,
                                           case when s.cnt > 1 then f.[date] end desc)
from   [file] f
       inner join
       (
           select accountNum, cnt = count(distinct statusID)
           from  [file]
           group by accountNum
       ) s  on f.accountNum = s.accountNum

What you want is the rows where r = 1