Recursion query in Postgres (mutual match based on score)-CodePudding

I got records from two different sources and the goal is to create link between sources that match each other. For that purpose there is applied trained AI model which assigns score of probability of match for each record from source A to each record to source B. The table score then looks like this.

src_a_id    src_b_id    score
-----------------------------
1           foo         0.8
1           bar         0.7
1           baz         0.6
2           foo         0.9
2           bar         0.5
2           baz         0.3

Now I need to read from this table what is the most likely counter match to src_a record with id 1. When you select data with sql SELECT * FROM score WHERE src_a_id = 1 ORDER BY score DESC; you will get this result.

src_a_id    src_b_id    score
-----------------------------
1           foo         0.8
1           bar         0.7
1           baz         0.6

Here it looks like the first row is result I am looking for and so the counter match is src_b record with id foo with mutual score 0.8 but it is not correct. We can query from the other side to verify result. What is counter match to src_b with id foo? Using sql SELECT * FROM score WHERE src_b_id = 'foo' ORDER BY score DESC; we get result:

src_a_id    src_b_id    score
-----------------------------
2           foo         0.9
1           foo         0.8

From the first query it has looked like src_a id 1 matches src_b id foo. From the second query it is clear that previous conclusion is wrong because src_b id foo matches to src_a id 2 because this pair has higher mutual score.

How should I write query to find the match to src_a record with id 1 considering that table will have thousands hundred records?

My first steps were searching for some recursive queries in Postgres but tutorials I have found didn't fit to my use case and honestly I am quite failing make up any working application so far.

EDIT

For the demonstration syntax of the creating testing data:

CREATE TABLE score (
    src_a_id integer NOT NULL,
    src_b_id varchar(255) NOT NULL,
    score decimal(3,2) NOT NULL
);

INSERT INTO score (src_a_id, src_b_id, score)
VALUES 
    (1, 'foo', 0.8),
    (1, 'bar', 0.7),
    (1, 'baz', 0.6),
    (2, 'foo', 0.9),
    (2, 'bar', 0.5),    
    (2, 'baz', 0.3);

From the testing data it can be derived there exist two pairs.

1 matches bar
2 matches foo
baz doesn't have match

How can I query for src_a id 1 match? Expected result is src_b id bar. And from the other side. How can I query for src_b id bar match? Expected result is src_a id 1.

CodePudding user response：

It seems likely that your problem can be solved w/o recursion by using a window function row_number() over(<partition>). What you want is to find such pairs where the score is maximal for each id individually.

Given the sample dataset you provided - we can write this CTE where we have 2 row numbers (one for each id) and then sum them to get the rank of a pair:

with ranks as (
    select 
        src_a_id, 
        src_b_id, score,
        row_number() over (partition by src_b_id order by score desc) src_b_idx,
        row_number() over (partition by src_a_id order by score desc) 
              row_number() over (partition by src_b_id order by score desc)  pair_rank
    from score
)

With that you'd get this result:

src_a_id  src_b_id  score   pair_rank
-------------------------------------
1         bar        0.7    3
1         baz        0.6    5
1         foo        0.8    3
2         bar        0.5    5
2         baz        0.8    3
2         foo        0.9    2

And now you can pick pairs where the pair_rank is minimal

select src_a_id, src_b_id, score from (
    select src_a_id, src_b_id, score, 
        row_number() over (partition by src_a_id order by pair_rank, src_b_idx) as index
    from ranks
) data where index = 1 and <CONDITION> (e.g. src_a_id = <YOUR ID>)

Given no the query will result in all pairs where the score is the highest

src_a_id  src_b_id  score
-------------------------
1         bar       0.7
2         foo       0.9

EDIT: There are several edge cases where the above approach yields ambiguous/incorrect result:

All pairs for a given src_A_id have lower score then any other pair sharing the same src_B_id (should the query return null/0 rows/highest amongst all src_A_id?)
Multiple pairs with the same highest score sharing src_B_id (which one wins over the other given the src_A_id are different?)
Multiple different src_A_id give the highest score for the same src_B_id (again, which one wins over the other given both src_A_id and src_B_id are the same?)

Given the below dataset, you can observe all 3 cases:

src_a_id src_b_id  score
------------------------
    1    foo       0.8  |
    1    bar       0.7  | -> all pairs are beanten by some other src_a_id
    1    baz       0.6  |
    2    foo       0.9
    2    bar       0.5
    2    baz       0.8  -> higest for `baz`, but 3.baz has the same score
    3    foo       0.91 |
    3    bar       0.91 | -> both pairs are the higest, but share src_a_id
    3    baz       0.8

And here is the adapted script, but you can adjust according to the desired behavior:

b_rank as (
    select 
        src_a_id, src_b_id,
        rank() over (partition by src_b_id order by score desc) src_b_idx
    from score
)

select src_a_id, src_b_id, score from (
    select 
        rank() over (partition by s.src_a_id order by score desc) score_rank, s.* 
    from score s
    join b_rank b on s.src_a_id = b.src_a_id and s.src_b_id = b.src_b_id and src_b_idx = 1
) data where score_rank = 1 and src_a_id = XXX

Which yield:

null if all pairs are taken (example src_a_id = 1)
the highest scored pair even if the same score is shared by another src_b_id (example src_a_id = 2)
multiple rows if all these pairs have the highest score given the same src_a_id (example src_a_id = 3)