Home > OS >  Excluding "some" duplicate rows in MS SQL Server 2008
Excluding "some" duplicate rows in MS SQL Server 2008

Time:11-12

-- Any suggestions for getting this to work in MS SQL Server 2008?

-- My definition of a 'duplicate row' is "the first 3 fields match" -- I need to exclude the duplicates from my results, but I decide which 1 row to keep -- Keep the rows that have a value in the 4th column (there will only be 1 row for each set of duplicate pairs)

-- The results will keep row 2 (but not row 1) -- Also keep row 3 (but not row 4) -- Also keep all the remaining rows (none are duplicates) -- 8 of the 10 rows should be returned

IF(OBJECT_ID('tempdb..#tmp') IS NOT NULL)       DROP TABLE #tmp

CREATE TABLE #tmp
(
    aKey Int            IDENTITY(1,1)   PRIMARY KEY,    
    
    f1 VarChar(10)          NOT NULL    DEFAULT 0,  -- 1
    f2 VarChar(10)          NOT NULL    DEFAULT 0,  -- 2
    f3 VarChar(10)          NOT NULL    DEFAULT 0,  -- 3 
    f4 VarChar(10)          NOT NULL    DEFAULT 0,
    f5 VarChar(10)          NOT NULL    DEFAULT 0,
    f6 VarChar(10)          NOT NULL    DEFAULT 0
)

INSERT INTO #tmp(f1, f2, f3, f4, f5, f6)    VALUES('A', 'B', 'C', '' , 'del', '1') -- 1st of the duplicate
INSERT INTO #tmp(f1, f2, f3, f4, f5, f6)    VALUES('A', 'B', 'C', 'D', 'keep', '1') -- 2nd of the duplicate

INSERT INTO #tmp(f1, f2, f3, f4, f5, f6)    VALUES('D', 'E', 'F', 'G', 'keep', '2') -- 1st of the duplicate
INSERT INTO #tmp(f1, f2, f3, f4, f5, f6)    VALUES('D', 'E', 'F', '' , 'del', '2') -- 2nd of the duplicate

INSERT INTO #tmp(f1, f2, f3, f4, f5, f6)    VALUES('H', 'G', 'N', 'Q', '1', 'K')  
INSERT INTO #tmp(f1, f2, f3, f4, f5, f6)    VALUES('I', 'G', 'C', '' , '2', 'L')  
INSERT INTO #tmp(f1, f2, f3, f4, f5, f6)    VALUES('J', 'H', 'D', 'R', '3', 'P')  
INSERT INTO #tmp(f1, f2, f3, f4, f5, f6)    VALUES('K', 'G', 'C', '' , '4', 'K')  
INSERT INTO #tmp(f1, f2, f3, f4, f5, f6)    VALUES('L', 'G', 'C', 'S', '5', 'V')  
INSERT INTO #tmp(f1, f2, f3, f4, f5, f6)    VALUES('K', 'M', 'C', '' , '6', 'K')  

--SELECT * FROM #tmp

-- This "almost" works, but is excluding too many non-duplicate rows:

SELECT DISTINCT t1.* FROM #tmp AS t1
INNER JOIN #tmp AS t2
ON t1.f1 = t2.f1 AND t1.f2 = t2.f2 AND t1.f3 = t2.f3
WHERE t1.f4 <> ''

CodePudding user response:

First of all, excellent job posting sample data in a format that people can use. Bravo!!!! That makes it easy for people to help. Secondly, even though your stated explanation was not entirely clear you did provide enough details in the sample data to make it clear what you need to accomplish. xQbert was spot on that using ROW_NUMBER is a great way to solve this.

I used a cte here to provide some clarity in how this works. You could do that as a subquery just as easily if you desire.

with SortedValues as
(
    select *
        , RowNum = ROW_NUMBER() over (partition by f1, f2, f3 order by f4 desc)
    from #tmp
)

select *
from SortedValues
where RowNum = 1
order by aKey

CodePudding user response:

In keeping with your existing approach, instead of joining just use exists

select * 
from t
where not exists (
    select * from t t2 
    where t.f1=t2.f1 
        and t.f2=t2.f2 
        and t.f3=t2.f3 
        and t.f4 =''
        and t2.akey !=t.akey
    )
  • Related