I have two DB2 tables, expenses
and environments
, which I need to join.
expenses
ID Expense Job
1 10 AAAAAA
2 5 BBBBBB
3 3 AAAAAAC
4 7 AAAAAA01
environments
Job Environment
AAAAAA01 PROD
BBBBBB INT
AAAAAAC PROD
I now want to join the environments
table to the expenses
table so that I know in which environments the expenses occurred. The problem with the tables is that the Job
column for both tables is slightly different. Hence, I can't just join ON ENVIRONMENTS.JOB = EXPENSES.JOB
The values are between 6 and 8 characters long, and the values in the expenses
table are often shorter than in the environments
table.
The logic for comparing the values of the Job
columns would be something like this:
- Check for an exact match of the
Job
if it is 8 characters long. - If no match is found, remove one character and try again.
- Repeat until the
Job
is six characters long.
In each step, if multiple values are found, select the first match.
I tried the following query
SELECT E.expense, ENV.environment
FROM EXPENSES E
LEFT OUTER JOIN ENVIRONMENTS ENV
ON LEFT(ENV.JOB, 6) = LEFT(E.JOB, 6)
GROUP BY E.expense, ENV.environment
The problem is that I receive duplicate values in the result. Hence, I get more expenses when grouping them by environment than I originally had.
Actual output:
ID Expense Job Job Environment
1 10 AAAAAA AAAAAA01 PROD
1 10 AAAAAA AAAAAAC PROD
2 5 BBBBBB BBBBBB INT
3 3 AAAAAAC AAAAAA01 PROD
3 3 AAAAAAC AAAAAAC PROD
4 7 AAAAAA01 AAAAAA01 PROD
4 7 AAAAAA01 AAAAAAC PROD
Desired output:
ID Expense Job Job Environment
1 10 AAAAAA AAAAAA01 PROD
2 5 BBBBBB BBBBBB INT
3 3 AAAAAAC AAAAAA01 PROD
4 7 AAAAAA01 AAAAAA01 PROD
The value in the second Job
column could be either AAAAAA01
or AAAAAAC
. This is completely irrelevant.
This is due to the `ON` condition. Let's consider the following example. The row with `ID=1` from the `expenses` table matches two rows in the `environments` table. Hence, the row with `ID=1` occurs twice in the result. Importantly, the matching rows from the `environments` table all have the same environment. Consequently, it is negligible which row I select.
How can I avoid duplicate values in the given scenario?
I have already tried the following solutions:
- How to Join to first row
- Remove Duplicates from LEFT OUTER JOIN
- How can a LEFT OUTER JOIN return more records than exist in the left table?
CodePudding user response:
Try this:
WITH
expenses (ID, Expense, Job) AS
(
VALUES
(1, 10, 'AAAAAA')
, (2, 5, 'BBBBBB')
, (3, 3, 'AAAAAAC')
, (4, 7, 'AAAAAA01')
, (5, 0, 'XXX')
)
, environments (Job, Environment) AS
(
VALUES
('AAAAAA01', 'PROD')
, ('BBBBBB' , 'INT')
, ('AAAAAAC' , 'PROD')
)
-- The 1-st solution
SELECT
ID, Expense, Job_E
, Job_ENV, Environment
FROM
(
SELECT
E.ID, E.Expense, E.Job AS Job_E
, ENV.Job AS Job_ENV, ENV.Environment
, ROW_NUMBER () OVER (PARTITION BY E.Job ORDER BY LENGTH (ENV.Job)) AS RN_
FROM expenses E
LEFT JOIN environments ENV ON ENV.JOB LIKE E.JOB || '%'
) T
WHERE RN_ = 1
ORDER BY ID
/*
-- Alternate solution
SELECT
E.ID, E.Expense, E.Job AS Job_E
, ENV.Job AS Job_ENV, ENV.Environment
FROM expenses E
LEFT JOIN TABLE
(
SELECT ENV.Job, ENV.Environment
FROM environments ENV
WHERE ENV.JOB LIKE E.JOB || '%'
ORDER BY LENGTH (ENV.Job)
FETCH FIRST 1 ROW ONLY
) ENV ON 1 = 1
ORDER BY E.ID
*/
ID | EXPENSE | JOB_E | JOB_ENV | ENVIRONMENT |
---|---|---|---|---|
1 | 10 | AAAAAA | AAAAAAC | PROD |
2 | 5 | BBBBBB | BBBBBB | INT |
3 | 3 | AAAAAAC | AAAAAAC | PROD |
4 | 7 | AAAAAA01 | AAAAAA01 | PROD |
5 | 0 | XXX |