How can I match a row in SQL Server only once?-CodePudding

I have the following problem where I am kindly asking for your help when joining two tables in SQL Server 2016 (v13).

I have 2 tables, Revenues and Cashins.

Revenues:

RevenueID	ProductID	InvoiceNo	Amount
123	456	987	1000
234	456	987	1000

Cashins:

CashinID	ProductID	InoviceNo	Amount
ABC	456	987	1000
CDE	456	987	1000

The goal is to match cashins automatically to revenues (but only once!).

Both tables have their unique-ids but the columns used to join these tables are

ProductID
InvoiceNo
Amount

For entries with only one row in each table with those criteria, everything works fine.

Sometimes though, there are several rows that have the same value within these columns (as above) but with a unique ID (this is no error, but the way it is supposed to be).

The problem with it is, that while joining it results in a cartesian product.

To recreate the tables, here the statements:

DROP TABLE IF EXISTS Revenues
GO

CREATE TABLE Revenues 
(
    RevenueID [nvarchar](10) NULL,  
    ProductID [nvarchar](10) NULL,  
    InvoiceNo [nvarchar](10) NULL,      
    Amount money NULL
)
GO

DROP TABLE IF EXISTS CashIns
GO

CREATE TABLE CashIns 
(
    CashinID [nvarchar](10) NULL,
    ProductID [nvarchar](10) NULL,  
    InvoiceNo [nvarchar](10) NULL,      
    Amount money NULL
)
GO

INSERT INTO [Revenues] VALUES ('123', '456', '987', 1000)
INSERT INTO [Revenues] VALUES ('234', '456', '987', 1000)

INSERT INTO [CashIns] VALUES ('ABC', '456', '987', 1000)
INSERT INTO [CashIns] VALUES ('BCD', '456', '987', 1000)

Desired output:

RevenueID	ProductID	InvoiceNo	Amount	CashinID
123	456	987	1000	ABC
234	456	987	1000	CDE

SELECT 
    R.RevenueID,
    R.ProductID,
    R.InvoiceNo,
    R.Amount,
    C.CashinID,
FROM 
    [Revenues] R
LEFT JOIN 
    [CashIns] C ON R.ProductID = C.ProductID
                AND R.InvoiceNo = C.InvoiceNo
                AND R.Amount = C.Amount

Results:

RevenueID	ProductID	InvoiceNo	Amount	CashinID
123	456	987	1000	ABC
123	456	987	1000	CDE
234	456	987	1000	ABC
234	456	987	1000	CDE

Which in theory makes sense, but I just can't seem to find a solution where each row is just used once.

Two things I found and tried are windowing functions and the OUTER APPLY function with a TOP(1) selection. Both came to the same result:

SELECT
    *
FROM 
    [Revenues] R
OUTER APPLY 
    (SELECT TOP(1) *
     FROM [CashIns] C) C

Which returns the desired columns from the Revenues table, but only matched the first appearance from the Cashins table:

RevenueID	ProductID	InvoiceNo	Amount	CashinID
123	456	987	1000	ABC
234	456	987	1000	ABC

I also thought about something like updating the Revenues table, so that the matched CashinID is next to a line and then check every time that the CashinID is not yet used within that table, but I couldn't make it work...

Many thanks in advance for any help or hint in the right direction!

CodePudding user response：

As I said in my comment, you have a fundamental problem with your data relationships. You need to reference the unique identifier of the other table in one of your tables. If you don't do that, then you can only order your transactions in both tables and join them by the row number. You're using a hope and prayer to join your data instead of unreputable identifier's.

--This example orders the transactions in each transaction table and uses
--the order number to join them.
WITH RevPrelim AS (
    SELECT *
      , ROW_NUMBER() OVER(PARTITION BY InvoiceNo, ProductID, Amount ORDER BY RevenueID) as row_num
    FROM [Revenues] R
) CashinsPrelim AS (
    SELECT *
        , ROW_NUMBER() OVER(PARTITION BY InvoiceNo, ProductID Amount ORDER BY CashinID) as row_num
    FROM [CashIns] as C
)
SELECT *
FROM RevPrlim as r
    LEFT OUTER JOIN CashinsPrelim as c
        ON c.ProductID = r.ProductID
            AND c.InvoiceNo = r.InvoiceNo
            AND c.Amount = r.Amount
            AND c.row_num = r.row_num