Home > Back-end >  Delete repeated occurrences of a value across two columns
Delete repeated occurrences of a value across two columns

Time:12-16

I have a dimension that stores salary awards by code and description. The Award_Code and Award_Desc combination forms a natural key. Each code should have only one description and each description should have only one code, but over the years people have added the same award code but with a different description or the same description but with a different award code, resulting in a table like below. In this example one of the award codes is found twice with different descriptions (Award_SK 6 and Award_SK 2270) and one of the Award descriptions is found twice with different codes (Award_SK 6 and Award_SK 2209). Only Award_SK 6, is the correct Award_Code/ Award_Desc combo and I need to remove the others from the dimension.

Award_SK Award_Code Award_Desc
6 AWDTEA Teachers Award
2209 TEAAWD Teachers Award
2270 AWDTEA Award for Teachers

To work out which descriptions and codes are associated with each other as in the above table, I have run the following code that grabs rows that join more than once on the Award_Code or the Award_Description.

--get the list of awards that are associated either by code or description, and put them in a temporary table
    SELECT * INTO #DuplicatedAwards
    FROM
    (
        SELECT Dim_Award_SK,AWARD_CODE, AWARD_DESC
        FROM
        (
            --Type 1: different Award codes, same award description
            SELECT  Dim_Award_SK, award_code,AWD.Award_Desc FROM    
            DM.DIM_AWARD AWD    
            INNER JOIN  
                (SELECT Award_Desc, COUNT(Dim_Award_SK) as total_of_Same_Description_different_code FROM DM.DIM_AWARD
                GROUP BY Award_Desc, Award_Class_Desc
                HAVING count(Award_Desc)>1 
                ) A ON AWD.Award_Desc=A.Award_Desc 
            
            UNION ALL
    
            --Type 2: different award description, same award code
            SELECT  Dim_Award_SK, A.Award_Code,AWD.Award_Desc FROM
            DM.DIM_AWARD AWD
            INNER JOIN
                (SELECT Award_Code,COUNT(Dim_Award_SK) as Total_of_Same_Code_Different_Description FROM DM.DIM_AWARD
                GROUP BY Award_Code
                HAVING count(DISTINCT Award_Desc)>1 
            ) A ON AWD.Award_Code=A.Award_Code 
        )B
    )C
    
    --Join the temporary table to the dimension on award code OR award description.  This will show an Award_SK in the first column 
    --and its matched Award_SK's in the second column
 --When a new SK starts in the first column we are looking at a new group of matched awards
    
    SELECT DISTINCT
    AW.Dim_Award_SK as Award_SK,
    DIM.Dim_Award_SK as Matching_Award_SK 
    FROM #DuplicatedAwards AW
    INNER JOIN DM.DIM_AWARD DIM
    ON DIM.Award_Code=AW.Award_Code OR DIM.Award_Desc=AW.Award_Desc
    --exclude rows where the affected SK is matched with itself
    WHERE DIM.Dim_Award_SK <> AW.Dim_Award_SK
    ORDER BY  AW.Dim_Award_SK, DIM.Dim_Award_SK
    
    DROP TABLE #DuplicatedAwards

This gives me a result like this:

Award_SK Matched Award_SK
6 2209
6 2270
8 1853
8 2278
17 2052
17 2442
22 1895
22 2282
22 2428
1853 8
1853 2278
1895 22
1895 2282
1895 2428
2052 17
2052 2442
2209 6
2209 2270
2270 6
2270 2209
2278 8
2278 1853
2282 22
2282 1895
2282 2428
2428 22
2428 1895
2428 2282
2442 17
2442 2052

The first two values in the left column are the same, so I know I need to look at the details of Award_SK 6, 2209 and 2270 in the dimension to work out business-wise, which is the right Award_SK to keep and which other two can be discarded. Next, rows 3 and 4 both show Award_SK 8 so I know that I need to look at Award_SK 8, 1853 and 2278 together, and so on. However, these combinations will show up more than once in a different arrangement as I move through the table. Award_SK 1853 eventually shows up again in the first column with Award_SK 8 and Award_SK 2278 in the second column. I have 8000 rows in my table but if I stop the combinations recurring, the table will be significantly smaller and I'll end up with a table like this. I'm not sure what to add to my code to make that happen. Maybe I could even do it in Excel, but again, not sure how.

Award_SK Matched Award_SK
6 2209
6 2270
8 1853
8 2278
17 2052
17 2442
22 1895
22 2282
22 2428

I'd really appreciate any help with this. Thanks.

CodePudding user response:

Instead of avoiding matching to the same, you can use that as matching to higher.

SELECT DISTINCT
  AW.Dim_Award_SK as Award_SK,
  DIM.Dim_Award_SK as Matching_Award_SK 
FROM #DuplicatedAwards AW
JOIN DM.DIM_AWARD DIM
  ON ( DIM.Award_Code = AW.Award_Code OR
       DIM.Award_Desc = AW.Award_Desc 
     ) 
 AND AW.Dim_Award_SK < DIM.Dim_Award_SK
ORDER BY AW.Dim_Award_SK, DIM.Dim_Award_SK
  • Related