Delete repeated occurrences of a value across two columns-CodePudding

I have a dimension that stores salary awards by code and description. The Award_Code and Award_Desc combination forms a natural key. Each code should have only one description and each description should have only one code, but over the years people have added the same award code but with a different description or the same description but with a different award code, resulting in a table like below. In this example one of the award codes is found twice with different descriptions (Award_SK 6 and Award_SK 2270) and one of the Award descriptions is found twice with different codes (Award_SK 6 and Award_SK 2209). Only Award_SK 6, is the correct Award_Code/ Award_Desc combo and I need to remove the others from the dimension.

Award_SK	Award_Code	Award_Desc
6	AWDTEA	Teachers Award
2209	TEAAWD	Teachers Award
2270	AWDTEA	Award for Teachers

To work out which descriptions and codes are associated with each other as in the above table, I have run the following code that grabs rows that join more than once on the Award_Code or the Award_Description.

--get the list of awards that are associated either by code or description, and put them in a temporary table
    SELECT * INTO #DuplicatedAwards
    FROM
    (
        SELECT Dim_Award_SK,AWARD_CODE, AWARD_DESC
        FROM
        (
            --Type 1: different Award codes, same award description
            SELECT  Dim_Award_SK, award_code,AWD.Award_Desc FROM    
            DM.DIM_AWARD AWD    
            INNER JOIN  
                (SELECT Award_Desc, COUNT(Dim_Award_SK) as total_of_Same_Description_different_code FROM DM.DIM_AWARD
                GROUP BY Award_Desc, Award_Class_Desc
                HAVING count(Award_Desc)>1 
                ) A ON AWD.Award_Desc=A.Award_Desc 
            
            UNION ALL
    
            --Type 2: different award description, same award code
            SELECT  Dim_Award_SK, A.Award_Code,AWD.Award_Desc FROM
            DM.DIM_AWARD AWD
            INNER JOIN
                (SELECT Award_Code,COUNT(Dim_Award_SK) as Total_of_Same_Code_Different_Description FROM DM.DIM_AWARD
                GROUP BY Award_Code
                HAVING count(DISTINCT Award_Desc)>1 
            ) A ON AWD.Award_Code=A.Award_Code 
        )B
    )C
    
    --Join the temporary table to the dimension on award code OR award description.  This will show an Award_SK in the first column 
    --and its matched Award_SK's in the second column
 --When a new SK starts in the first column we are looking at a new group of matched awards
    
    SELECT DISTINCT
    AW.Dim_Award_SK as Award_SK,
    DIM.Dim_Award_SK as Matching_Award_SK 
    FROM #DuplicatedAwards AW
    INNER JOIN DM.DIM_AWARD DIM
    ON DIM.Award_Code=AW.Award_Code OR DIM.Award_Desc=AW.Award_Desc
    --exclude rows where the affected SK is matched with itself
    WHERE DIM.Dim_Award_SK <> AW.Dim_Award_SK
    ORDER BY  AW.Dim_Award_SK, DIM.Dim_Award_SK
    
    DROP TABLE #DuplicatedAwards

This gives me a result like this:

Award_SK	Matched Award_SK
6	2209
6	2270
8	1853
8	2278
17	2052
17	2442
22	1895
22	2282
22	2428
1853	8
1853	2278
1895	22
1895	2282
1895	2428
2052	17
2052	2442
2209	6
2209	2270
2270	6
2270	2209
2278	8
2278	1853
2282	22
2282	1895
2282	2428
2428	22
2428	1895
2428	2282
2442	17
2442	2052

The first two values in the left column are the same, so I know I need to look at the details of Award_SK 6, 2209 and 2270 in the dimension to work out business-wise, which is the right Award_SK to keep and which other two can be discarded. Next, rows 3 and 4 both show Award_SK 8 so I know that I need to look at Award_SK 8, 1853 and 2278 together, and so on. However, these combinations will show up more than once in a different arrangement as I move through the table. Award_SK 1853 eventually shows up again in the first column with Award_SK 8 and Award_SK 2278 in the second column. I have 8000 rows in my table but if I stop the combinations recurring, the table will be significantly smaller and I'll end up with a table like this. I'm not sure what to add to my code to make that happen. Maybe I could even do it in Excel, but again, not sure how.

Award_SK	Matched Award_SK
6	2209
6	2270
8	1853
8	2278
17	2052
17	2442
22	1895
22	2282
22	2428

I'd really appreciate any help with this. Thanks.

CodePudding user response：

Instead of avoiding matching to the same, you can use that as matching to higher.

SELECT DISTINCT
  AW.Dim_Award_SK as Award_SK,
  DIM.Dim_Award_SK as Matching_Award_SK 
FROM #DuplicatedAwards AW
JOIN DM.DIM_AWARD DIM
  ON ( DIM.Award_Code = AW.Award_Code OR
       DIM.Award_Desc = AW.Award_Desc 
     ) 
 AND AW.Dim_Award_SK < DIM.Dim_Award_SK
ORDER BY AW.Dim_Award_SK, DIM.Dim_Award_SK