I am stuck on the best way to handle this. I am importing a file and building a data cleaning process. I have no choice but to accept this bad data. The file has some duplicate account numbers.
If the duplicate accountNum contains the same statusID, then I need to take the earliest date, but if the duplicate accountNum contains a different statusID then I need to take the latest date.
My thought is to separate the records into multiple temp tables and finally union the cleaned dups with the non dups... but that seems inefficient.
This is where I am starting:
SELECT accountNum,
statusID,
date,
row_number() OVER (
PARTITION BY accountNum
ORDER BY date DESC
) acct_dup_row_num,
row_number() OVER (
PARTITION BY CONCAT(accountNum,statusID)
ORDER BY date ASC
) status_dup_row_num
INTO #tempTable1
FROM file
GROUP BY accountNum,
statusID,
date
HAVING COUNT(accountNum) > 1
CodePudding user response:
use a derived table to count the distinct statusID
. Based on the distinct count and order the [date] ASC or DESC.
select f.accountNum,
f.statusID,
f.[date],
r = row_number() over (partition by f.accountNum
order by case when s.cnt = 1 then f.[date] end asc,
case when s.cnt > 1 then f.[date] end desc)
from [file] f
inner join
(
select accountNum, cnt = count(distinct statusID)
from [file]
group by accountNum
) s on f.accountNum = s.accountNum
What you want is the rows where r = 1