I have a dimension that stores salary awards by code and description. The Award_Code and Award_Desc combination forms a natural key. Each code should have only one description and each description should have only one code, but over the years people have added the same award code but with a different description or the same description but with a different award code, resulting in a table like below. In this example one of the award codes is found twice with different descriptions (Award_SK 6 and Award_SK 2270) and one of the Award descriptions is found twice with different codes (Award_SK 6 and Award_SK 2209). Only Award_SK 6, is the correct Award_Code/ Award_Desc combo and I need to remove the others from the dimension.
Award_SK | Award_Code | Award_Desc |
---|---|---|
6 | AWDTEA | Teachers Award |
2209 | TEAAWD | Teachers Award |
2270 | AWDTEA | Award for Teachers |
To work out which descriptions and codes are associated with each other as in the above table, I have run the following code that grabs rows that join more than once on the Award_Code or the Award_Description.
--get the list of awards that are associated either by code or description, and put them in a temporary table
SELECT * INTO #DuplicatedAwards
FROM
(
SELECT Dim_Award_SK,AWARD_CODE, AWARD_DESC
FROM
(
--Type 1: different Award codes, same award description
SELECT Dim_Award_SK, award_code,AWD.Award_Desc FROM
DM.DIM_AWARD AWD
INNER JOIN
(SELECT Award_Desc, COUNT(Dim_Award_SK) as total_of_Same_Description_different_code FROM DM.DIM_AWARD
GROUP BY Award_Desc, Award_Class_Desc
HAVING count(Award_Desc)>1
) A ON AWD.Award_Desc=A.Award_Desc
UNION ALL
--Type 2: different award description, same award code
SELECT Dim_Award_SK, A.Award_Code,AWD.Award_Desc FROM
DM.DIM_AWARD AWD
INNER JOIN
(SELECT Award_Code,COUNT(Dim_Award_SK) as Total_of_Same_Code_Different_Description FROM DM.DIM_AWARD
GROUP BY Award_Code
HAVING count(DISTINCT Award_Desc)>1
) A ON AWD.Award_Code=A.Award_Code
)B
)C
--Join the temporary table to the dimension on award code OR award description. This will show an Award_SK in the first column
--and its matched Award_SK's in the second column
--When a new SK starts in the first column we are looking at a new group of matched awards
SELECT DISTINCT
AW.Dim_Award_SK as Award_SK,
DIM.Dim_Award_SK as Matching_Award_SK
FROM #DuplicatedAwards AW
INNER JOIN DM.DIM_AWARD DIM
ON DIM.Award_Code=AW.Award_Code OR DIM.Award_Desc=AW.Award_Desc
--exclude rows where the affected SK is matched with itself
WHERE DIM.Dim_Award_SK <> AW.Dim_Award_SK
ORDER BY AW.Dim_Award_SK, DIM.Dim_Award_SK
DROP TABLE #DuplicatedAwards
This gives me a result like this:
Award_SK | Matched Award_SK |
---|---|
6 | 2209 |
6 | 2270 |
8 | 1853 |
8 | 2278 |
17 | 2052 |
17 | 2442 |
22 | 1895 |
22 | 2282 |
22 | 2428 |
1853 | 8 |
1853 | 2278 |
1895 | 22 |
1895 | 2282 |
1895 | 2428 |
2052 | 17 |
2052 | 2442 |
2209 | 6 |
2209 | 2270 |
2270 | 6 |
2270 | 2209 |
2278 | 8 |
2278 | 1853 |
2282 | 22 |
2282 | 1895 |
2282 | 2428 |
2428 | 22 |
2428 | 1895 |
2428 | 2282 |
2442 | 17 |
2442 | 2052 |
The first two values in the left column are the same, so I know I need to look at the details of Award_SK 6, 2209 and 2270 in the dimension to work out business-wise, which is the right Award_SK to keep and which other two can be discarded. Next, rows 3 and 4 both show Award_SK 8 so I know that I need to look at Award_SK 8, 1853 and 2278 together, and so on. However, these combinations will show up more than once in a different arrangement as I move through the table. Award_SK 1853 eventually shows up again in the first column with Award_SK 8 and Award_SK 2278 in the second column. I have 8000 rows in my table but if I stop the combinations recurring, the table will be significantly smaller and I'll end up with a table like this. I'm not sure what to add to my code to make that happen. Maybe I could even do it in Excel, but again, not sure how.
Award_SK | Matched Award_SK |
---|---|
6 | 2209 |
6 | 2270 |
8 | 1853 |
8 | 2278 |
17 | 2052 |
17 | 2442 |
22 | 1895 |
22 | 2282 |
22 | 2428 |
I'd really appreciate any help with this. Thanks.
CodePudding user response:
Instead of avoiding matching to the same, you can use that as matching to higher.
SELECT DISTINCT
AW.Dim_Award_SK as Award_SK,
DIM.Dim_Award_SK as Matching_Award_SK
FROM #DuplicatedAwards AW
JOIN DM.DIM_AWARD DIM
ON ( DIM.Award_Code = AW.Award_Code OR
DIM.Award_Desc = AW.Award_Desc
)
AND AW.Dim_Award_SK < DIM.Dim_Award_SK
ORDER BY AW.Dim_Award_SK, DIM.Dim_Award_SK