DB2 | Left outer join | Avoid duplicates-CodePudding

I have two DB2 tables, expenses and environments, which I need to join.

expenses

ID    Expense  Job
1     10       AAAAAA
2     5        BBBBBB 
3     3        AAAAAAC
4     7        AAAAAA01

environments

Job        Environment
AAAAAA01   PROD
BBBBBB     INT
AAAAAAC    PROD

I now want to join the environments table to the expenses table so that I know in which environments the expenses occurred. The problem with the tables is that the Job column for both tables is slightly different. Hence, I can't just join ON ENVIRONMENTS.JOB = EXPENSES.JOB The values are between 6 and 8 characters long, and the values in the expenses table are often shorter than in the environments table.
The logic for comparing the values of the Job columns would be something like this:

Check for an exact match of the Job if it is 8 characters long.
If no match is found, remove one character and try again.
Repeat until the Job is six characters long.

In each step, if multiple values are found, select the first match.

I tried the following query

SELECT E.expense, ENV.environment
FROM EXPENSES E
LEFT OUTER JOIN ENVIRONMENTS ENV
ON LEFT(ENV.JOB, 6) = LEFT(E.JOB, 6)
GROUP BY E.expense, ENV.environment

The problem is that I receive duplicate values in the result. Hence, I get more expenses when grouping them by environment than I originally had.

Actual output:

ID    Expense  Job        Job         Environment
1     10       AAAAAA     AAAAAA01    PROD
1     10       AAAAAA     AAAAAAC     PROD
2     5        BBBBBB     BBBBBB      INT
3     3        AAAAAAC    AAAAAA01    PROD
3     3        AAAAAAC    AAAAAAC     PROD
4     7        AAAAAA01   AAAAAA01    PROD
4     7        AAAAAA01   AAAAAAC     PROD

Desired output:

ID    Expense  Job        Job         Environment
1     10       AAAAAA     AAAAAA01    PROD
2     5        BBBBBB     BBBBBB      INT
3     3        AAAAAAC    AAAAAA01    PROD
4     7        AAAAAA01   AAAAAA01    PROD

The value in the second Job column could be either AAAAAA01 or AAAAAAC. This is completely irrelevant.

This is due to the `ON` condition. Let's consider the following example. The row with `ID=1` from the `expenses` table matches two rows in the `environments` table. Hence, the row with `ID=1` occurs twice in the result. Importantly, the matching rows from the `environments` table all have the same environment. Consequently, it is negligible which row I select.

How can I avoid duplicate values in the given scenario?

I have already tried the following solutions:

CodePudding user response：

Try this:

WITH 
  expenses (ID, Expense, Job) AS
  (
    VALUES
      (1, 10, 'AAAAAA')
    , (2,  5, 'BBBBBB')
    , (3,  3, 'AAAAAAC')
    , (4,  7, 'AAAAAA01')

    , (5,  0, 'XXX')
  )
, environments (Job, Environment) AS 
  (
    VALUES
      ('AAAAAA01', 'PROD')
    , ('BBBBBB'  , 'INT')
    , ('AAAAAAC' , 'PROD')
  )

-- The 1-st solution
SELECT
  ID, Expense, Job_E
, Job_ENV, Environment
FROM
(
  SELECT 
    E.ID, E.Expense, E.Job AS Job_E
  , ENV.Job AS Job_ENV, ENV.Environment
  , ROW_NUMBER () OVER (PARTITION BY E.Job ORDER BY LENGTH (ENV.Job)) AS RN_
  FROM expenses E
  LEFT JOIN environments ENV ON ENV.JOB LIKE E.JOB || '%'
) T
WHERE RN_ = 1
ORDER BY ID

/*
  -- Alternate solution
  SELECT
  E.ID, E.Expense, E.Job AS Job_E
, ENV.Job AS Job_ENV, ENV.Environment
FROM expenses E
LEFT JOIN TABLE
(
  SELECT ENV.Job, ENV.Environment
  FROM environments ENV 
  WHERE ENV.JOB LIKE E.JOB || '%'
  ORDER BY LENGTH (ENV.Job)
  FETCH FIRST 1 ROW ONLY
) ENV ON 1 = 1
ORDER BY E.ID
*/

ID	EXPENSE	JOB_E	JOB_ENV	ENVIRONMENT
1	10	AAAAAA	AAAAAAC	PROD
2	5	BBBBBB	BBBBBB	INT
3	3	AAAAAAC	AAAAAAC	PROD
4	7	AAAAAA01	AAAAAA01	PROD
5	0	XXX