I got records from two different sources and the goal is to create link between sources that match each other. For that purpose there is applied trained AI model which assigns score of probability of match for each record from source A to each record to source B. The table score
then looks like this.
src_a_id src_b_id score
-----------------------------
1 foo 0.8
1 bar 0.7
1 baz 0.6
2 foo 0.9
2 bar 0.5
2 baz 0.3
Now I need to read from this table what is the most likely counter match to src_a
record with id 1
. When you select data with sql SELECT * FROM score WHERE src_a_id = 1 ORDER BY score DESC;
you will get this result.
src_a_id src_b_id score
-----------------------------
1 foo 0.8
1 bar 0.7
1 baz 0.6
Here it looks like the first row is result I am looking for and so the counter match is src_b
record with id foo
with mutual score 0.8
but it is not correct. We can query from the other side to verify result. What is counter match to src_b
with id foo
? Using sql SELECT * FROM score WHERE src_b_id = 'foo' ORDER BY score DESC;
we get result:
src_a_id src_b_id score
-----------------------------
2 foo 0.9
1 foo 0.8
From the first query it has looked like src_a
id 1
matches src_b
id foo
.
From the second query it is clear that previous conclusion is wrong because src_b
id foo
matches to src_a
id 2
because this pair has higher mutual score.
How should I write query to find the match to src_a
record with id 1
considering that table will have thousands hundred records?
My first steps were searching for some recursive queries in Postgres but tutorials I have found didn't fit to my use case and honestly I am quite failing make up any working application so far.
EDIT
For the demonstration syntax of the creating testing data:
CREATE TABLE score (
src_a_id integer NOT NULL,
src_b_id varchar(255) NOT NULL,
score decimal(3,2) NOT NULL
);
INSERT INTO score (src_a_id, src_b_id, score)
VALUES
(1, 'foo', 0.8),
(1, 'bar', 0.7),
(1, 'baz', 0.6),
(2, 'foo', 0.9),
(2, 'bar', 0.5),
(2, 'baz', 0.3);
From the testing data it can be derived there exist two pairs.
1
matchesbar
2
matchesfoo
baz
doesn't have match
How can I query for src_a id 1
match? Expected result is src_b id bar
. And from the other side. How can I query for src_b id bar
match? Expected result is src_a id 1
.
CodePudding user response:
It seems likely that your problem can be solved w/o recursion by using a window function row_number() over(<partition>)
. What you want is to find such pairs where the score is maximal for each id individually.
Given the sample dataset you provided - we can write this CTE where we have 2 row numbers (one for each id) and then sum them to get the rank of a pair:
with ranks as (
select
src_a_id,
src_b_id, score,
row_number() over (partition by src_b_id order by score desc) src_b_idx,
row_number() over (partition by src_a_id order by score desc)
row_number() over (partition by src_b_id order by score desc) pair_rank
from score
)
With that you'd get this result:
src_a_id src_b_id score pair_rank
-------------------------------------
1 bar 0.7 3
1 baz 0.6 5
1 foo 0.8 3
2 bar 0.5 5
2 baz 0.8 3
2 foo 0.9 2
And now you can pick pairs where the pair_rank
is minimal
select src_a_id, src_b_id, score from (
select src_a_id, src_b_id, score,
row_number() over (partition by src_a_id order by pair_rank, src_b_idx) as index
from ranks
) data where index = 1 and <CONDITION> (e.g. src_a_id = <YOUR ID>)
Given no the query will result in all pairs where the score is the highest
src_a_id src_b_id score
-------------------------
1 bar 0.7
2 foo 0.9
EDIT: There are several edge cases where the above approach yields ambiguous/incorrect result:
- All pairs for a given
src_A_id
have lower score then any other pair sharing the samesrc_B_id
(should the query return null/0 rows/highest amongst allsrc_A_id
?) - Multiple pairs with the same highest score sharing
src_B_id
(which one wins over the other given the src_A_id are different?) - Multiple different
src_A_id
give the highest score for the samesrc_B_id
(again, which one wins over the other given both src_A_id and src_B_id are the same?)
Given the below dataset, you can observe all 3 cases:
src_a_id src_b_id score
------------------------
1 foo 0.8 |
1 bar 0.7 | -> all pairs are beanten by some other src_a_id
1 baz 0.6 |
2 foo 0.9
2 bar 0.5
2 baz 0.8 -> higest for `baz`, but 3.baz has the same score
3 foo 0.91 |
3 bar 0.91 | -> both pairs are the higest, but share src_a_id
3 baz 0.8
And here is the adapted script, but you can adjust according to the desired behavior:
b_rank as (
select
src_a_id, src_b_id,
rank() over (partition by src_b_id order by score desc) src_b_idx
from score
)
select src_a_id, src_b_id, score from (
select
rank() over (partition by s.src_a_id order by score desc) score_rank, s.*
from score s
join b_rank b on s.src_a_id = b.src_a_id and s.src_b_id = b.src_b_id and src_b_idx = 1
) data where score_rank = 1 and src_a_id = XXX
Which yield:
null
if all pairs are taken (examplesrc_a_id
= 1)- the highest scored pair even if the same score is shared by another
src_b_id
(examplesrc_a_id
= 2) - multiple rows if all these pairs have the highest score given the same
src_a_id
(examplesrc_a_id
= 3)