Home > Mobile >  Match each duplicate row to a distinct record
Match each duplicate row to a distinct record

Time:06-17

When we load new records onto our system, we first hold them in a staging loadedRecords table, where we attempt to match them to existing records on system.

Once a loaded record is matched with an existing record, we update the loaded record to reference the id of the existing record, however it's possible for there to be multiple "identical" records, ie multiple records with the same matching criteria, and I can't find a way to do this such that each existing record is matched to a different loaded record.

DB<>Fiddle

I currently have this:

UPDATE loadedRecords
SET recordID = r.id
FROM loadedRecords
  JOIN records as r
    ON r.amount = loadedRecords.amount
WHERE loadedRecords.recordID IS NULL

However this causes all 3 loaded records with an amount of 400 to be matched to the first existing record with an amount of 400:

id amount recordID
1 400 1
2 500 4
3 400 1
4 400 1

Note that ids 1, 3, and 4 were all matched with recordID 1. The result I want from this is:

id amount recordID
1 400 1
2 500 4
3 400 2
4 400 3

The exact order of the matching doesn't matter, as long as each existing record is matched to only one loaded record, and each loaded record is matched to only one existing record.

It's also possible for there to be loaded records that have no match, including the possibility of there being more "identical" loaded records than matching "identical" existing records (eg if there are 4 loaded records with an amount of 400, but only 3 existing records with that amount)
These extra loaded records should be left unmatched with a recordID of NULL

CodePudding user response:

This is actually quite easy to do if you use a simple trick which is the ROW_NUMBER() windowing function. With this function we can give each group of amt values a unique integer starting at 1 in a new column. Here is the code that does that for your tables:

 SELECT *, ROW_NUMBER() OVER (PARTITION BY amount) AS RN
 FROM loadedRecords

 SELECT *, ROW_NUMBER() OVER (PARTITION BY amount) AS RN
 FROM records

Now we just take these two and join them

UPDATE loadedRecords
SET recordID = r.id
FROM loadedRecords
JOIN (
  SELECT id, amount, ROW_NUMBER() OVER (PARTITION BY amount) AS RN
  FROM records
) r ON r.amount = loadedRecords.amount AND
       r.RN = ROW_NUMBER() OVER (PARTITION BY loadedRecords.amount)
WHERE loadedRecords.recordID IS NULL
  • Related