Home > Back-end >  Efficient code for fetching all duplicate records on large datasets
Efficient code for fetching all duplicate records on large datasets

Time:05-06

I want to assign scores of zero to any record having duplicates and a score of 1 to all unique records. I have a set of data like this: Table 1-

No. City
1 null
2 null
3 null
4 Aachen
5 Berlin
6 Berlin
7 Berlin
8 Bochum
9 Bochum
10 Bristol
11 Liverpool
12 Liverpool

So, the expected result will be: Table 2 -

No. City Score
1 null 0
2 null 0
3 null 0
4 Aachen 1
5 Berlin 0
6 Berlin 0
7 Berlin 0
8 Bochum 0
9 Bochum 0
10 Bristol 1
11 Liverpool 0
12 Liverpool 0
select city,
case  when [City] in ( select [City] from [Table1] group by [City] having count([City]) = 1) then 1 else 0 end as [Score]
   from Table1 

This code works well on datasets smaller than 100k rows but if it deals with larger datasets it is too slow and the execution time runs out sometimes. It is important to recognize null values as duplicates as well. Can anyone please suggest a more efficient solution than this?

CodePudding user response:

Below is the solution that I have implemented. Reason for converting the NULLs to some character is because NULLs will be eliminated if we apply a count function

create table temp
(id int,
city varchar(200))

insert into temp
values
(1,null),
(2,null),
(3,null),
(4,'Aachen'),
(5,'Berlin'),
(6,'Berlin'),
(7,'Berlin'),
(8,'Bochum'),
(9,'Bochum'),
(10,'Bristol'),
(11,'Liverpool'),
(12,'Liverpool')

;with cte as (
select id, case when city is null then 'A' else city end as city from temp)

select id, case when city ='A' then NULL else city end as city, 
case when count(city) over (partition by city order by city )>1 then 0 else 1 end as score
from cte

enter image description here

CodePudding user response:

I'm not sure if this will be faster, depends on how the query builder works for Sql Server 2008. But it feels faster for me to structure the query like this :)

select [City],
coalesce(score, 0) as score
from [Table1] t
left join
(select [City], 1 as score from [Table1]
group by [City] having count([City]) = 1
) x on x.[City] = t.[City]
  • Related