I am trying to figure out how to solve an issue with a left join with two large tables (150k x 300k rows).
I have a key that should be unique, but sometimes isn't, in the format of 11002_13202. It is possible that this value occurs twice. There is an additional column that provides information on the uniqueness, which is provided like 'AAA'. I can add a where clause which would check for the match on the key AND the additional info (let's say 11002_13202_AAA). This will prevent wrong matches.
However, sometimes the additional field might be AAA in table B, which mismatches with the additional field in table A, and there is only one match on the key when joined.
Is it possible to optionally match on the additional field IF more than one join is done, and join anyways when there is only 1 match and the additional field is not necessarily equal between both tables?
Hoping to achieve the following result after joining:
Table A key | Table B key | Additional Field A | Additional Field B |
---|---|---|---|
11002_13202 | 11002_13202 | AAA | AAA |
11002_13202 | NULL | BBB | NULL |
92321_32322 | 92321_32322 | BBB | AAA |
2310_23123 | NULL | AAA | NULL |
2310_23123 | NULL | BBB | NULL |
2310_23123 | 2310_23123 | CCC | CCC |
23821_2322 | NULL | BBB | NULL |
23821_2322 | 23821_2322 | BBB | BBB |
Sample data table A
Table A key | Additional Field A |
---|---|
11002_13202 | AAA |
11002_13202 | BBB |
92321_32322 | BBB |
2310_23123 | AAA |
2310_23123 | BBB |
2310_23123 | CCC |
23821_2322 | AAA |
23821_2322 | BBB |
Sample data table B
Table B key | Additional Field B |
---|---|
11002_13202 | AAA |
92321_32322 | AAA |
2310_23123 | CCC |
23821_2322 | BBB |
CodePudding user response:
If my understanding is correct, you could use ROW_NUMBER
to only return the value of table B's columns if it's the first row, and order it by if the values match, and then an arbitrary value, I use the "addition field" (they are columns not "fields") here:
SELECT *
INTO dbo.TableA
FROM (VALUES('11002_13202','AAA'),
('11002_13202','BBB'),
('92321_32322','BBB'))V([Key],Additional);
SELECT *
INTO dbo.TableB
FROM (VALUES('11002_13202','AAA'),
('92321_32322','AAA'))V([Key],Additional);
GO
SELECT A.[Key] AS AKey,
CASE ROW_NUMBER() OVER (PARTITION BY A.[Key] ORDER BY CASE A.[Key] WHEN B.[Key] THEN 0 ELSE 1 END, A.Additional) WHEN 1 THEN B.[Key] END AS BKey,
A.Additional AS AAdditional,
CASE ROW_NUMBER() OVER (PARTITION BY A.[Key] ORDER BY CASE A.[Key] WHEN B.[Key] THEN 0 ELSE 1 END, A.Additional) WHEN 1 THEN B.Additional END AS BAdditional
FROM dbo.TableA A
JOIN dbo.TableB B ON A.[Key] = B.[Key]; --Assumed INNER JOIN, might want to be a LEFT JOIN though?
GO
DROP TABLE dbo.TableA;
DROP TABLE dbo.TableB;
If you prefer, you could put the ROW_NUMBER
expression into a CTE/derived table, so you don't need to repeat the logic:
WITH CTE AS(
SELECT A.[Key] AS AKey,
B.[Key] AS BKey,
A.Additional AS AAdditional,
B.Additional AS BAdditional,
ROW_NUMBER() OVER (PARTITION BY A.[Key] ORDER BY CASE A.[Key] WHEN B.[Key] THEN 0 ELSE 1 END, A.Additional) AS RN
FROM dbo.TableA A
JOIN dbo.TableB B ON A.[Key] = B.[Key]) --Assumed INNER JOIN, might want to be a LEFT JOIN though?
SELECT AKey,
CASE RN WHEN 1 THEN BKey END AS BKey,
AAdditional,
CASE RN WHEN 1 THEN BAdditional END AS BAdditional
FROM CTE;
As mentioned in the comments though, the data is the fundamental problem here, and time should (must) be taken to address that. Getting a "working solution" now as an interim is "fine", but having the repeat odd logic like this only make things harder over time, not easier. You'll likely end up with needing to use VIEW
s and (inline table type) FUNCTION
s just to get datasets in a consumable format for your basic queries, which could end up hindering performance if the data is really bad.
Taking time to schedule fixes for your data in the new year should be something on the priority list.