I've been stuck on this SQL query for a day now, so I'm throwing it up here and would appreciate any advice others can give.
This is the problem: I want to generate a set of pairs of tags
(named entities from articles), a
and b
, ordered by how many articles they co-occur in. This is relatively simple. However, there's a twist: the query should also check another table, link
, to see if there's already an existing link between both tags. A link
is a directed edge, ie. two tags
could be connected either a->b
or b->a
.
As a minimum, I want to filter out all links where a
and b
are already connected - but a better implementation would allow me to return unfiltered pairs, with the type
of the link whereever a link exists.
Here's the basic pair-generating query, which works as expected:
SELECT
l.cluster AS left_id,
l.cluster_type AS left_type,
l.cluster_label AS left_label,
r.cluster AS right_id,
r.cluster_type AS right_type,
r.cluster_label AS right_label,
count(distinct(l.article)) AS articles
FROM tag AS l, tag AS r
WHERE
l.cluster > r.cluster
AND l.article = r.article
GROUP BY l.cluster, l.cluster_label, l.cluster_type, r.cluster, r.cluster_label, r.cluster_type
ORDER BY count(distinct(l.article)) DESC;
CTE-based approach
Here's a sort of solution to the sub-problem of getting all the pairs where a link exists:
WITH links AS (
SELECT
greatest(link.source_cluster, link.target_cluster) AS big,
least(link.source_cluster, link.target_cluster) AS smol,
link.type AS type
FROM link AS link
)
SELECT l.cluster AS left_id, l.cluster_type AS left_type, l.cluster_label AS left_label, r.cluster AS right_id, r.cluster_type AS right_type, r.cluster_label AS right_label,
count(distinct(l.article)) AS articles,
array_agg(distinct(links.type)) AS link_types
FROM tag AS r, tag AS l
JOIN links ON l.cluster = links.big
WHERE
l.cluster > r.cluster
AND l.article = r.article
AND r.cluster = links.smol
GROUP BY l.cluster, l.cluster_label, l.cluster_type, r.cluster, r.cluster_label, r.cluster_type
ORDER BY count(distinct(l.article)) DESC
But this doesn't handle showing unlinked pairs, or showing both linked and unlinked pairs. Maybe there's some way of sub-querying the links
CTE in the main query that would handle non-linked pairs?
Table definitions
CREATE TABLE tag (
cluster character varying(40),
article character varying(255),
cluster_type character varying(10),
cluster_label character varying,
);
CREATE TABLE link (
source_cluster character varying(40),
target_cluster character varying(40),
type character varying(255),
);
Example data
tag
:
"cluster","cluster_type","cluster_label","article"
"fffcc580c020f689e206fddbc32777f0d0866f23","LOC","Russia","a"
"fffcc580c020f689e206fddbc32777f0d0866f23","LOC","Russia","b"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","a"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","b"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","d"
"ff9be8adf69cddee1b910e592b119478388e2194","LOC","Moscow","a"
"ff9be8adf69cddee1b910e592b119478388e2194","LOC","Moscow","b"
"ffeeb6ebcdc1fe87a3a2b84d707e17bd716dd20b","LOC","Latvia","a"
"ffd364472a999c3d1001f5910398a53997ae0afe","ORG","OCCRP","a"
"ffd364472a999c3d1001f5910398a53997ae0afe","ORG","OCCRP","d"
"fef5381215b1dfded414f5e60469ce32f3334fdd","ORG","Moldindconbank","a"
"fef5381215b1dfded414f5e60469ce32f3334fdd","ORG","Moldindconbank","c"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","a"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","b"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","d"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","ORG","Moldova","a"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","ORG","Moldova","c"
link
"source_cluster","target_cluster","type"
"fff03a54c98cf079d562998d511ef2823d1f1863","fffcc580c020f689e206fddbc32777f0d0866f23","LOCATED"
"fe855a808f535efa417f6d082f5e5b6581fb6835","fff03a54c98cf079d562998d511ef2823d1f1863","EMPLOYER"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","fef5381215b1dfded414f5e60469ce32f3334fdd","LOCATED"
CodePudding user response:
I think what you want is a LEFT OUTER JOIN
from tags to links, which doesn't filter out tag pairs without a link but annotates them with the link when it exists.
SELECT
l.cluster AS left_id,
l.cluster_type AS left_type,
l.cluster_label AS left_label,
r.cluster AS right_id,
r.cluster_type AS right_type,
r.cluster_label AS right_label,
count(distinct(l.article)) AS articles,
max(link.type) as type,
l.cluster = max(link.source_cluster) AS "->"
FROM tag AS l CROSS JOIN tag AS r
LEFT OUTER JOIN link ON
(l.cluster = link.source_cluster AND r.cluster = link.target_cluster)
OR
(l.cluster = link.target_cluster AND r.cluster = link.source_cluster)
WHERE
l.cluster > r.cluster
AND l.article = r.article
GROUP BY l.cluster, l.cluster_label, l.cluster_type, r.cluster, r.cluster_label, r.cluster_type
ORDER BY count(distinct(l.article)) DESC;
The max() business is me being lazy about fixing the fact that the query doesn't know there's at most one link per tag pair so it needs an aggregate function.
CodePudding user response:
To account for the possibility of links at a depth > 2 (the current link
table only has a linking of depth 2
: fff03a54c98cf079d562998d511ef2823d1f1863 > fe855a808f535efa417f6d082f5e5b6581fb6835
), you can use a recursive cte
to build the source_cluster
paths to each target_cluster
, which can then be flattened and left join
ed onto your main query:
with recursive cte(a, b, js) as (
select l.source_cluster, l.target_cluster, jsonb_build_array(l.source_cluster) from link l
union all
select l.source_cluster, l.target_cluster, js || jsonb_build_array(l.source_cluster) from cte c
join link l on l.source_cluster = c.b
)
select t1.* from (
select t1.cluster l1, t2.cluster l2, sum(case when t1.article = t2.article then 1 end) s
from tag t1 cross join tag t2
where t1.cluster < t2.cluster
group by t1.cluster, t2.cluster) t1
left join (select v#>>'{}' a, c.b b
from cte c cross join jsonb_array_elements(c.js) v) t2
on greatest(t1.l1, t1.l2) = greatest(t2.a, t2.b) and least(t1.l1, t1.l2) = least(t2.a, t2.b)
where t2.a is null
order by t1.s desc