Struggle with complex SQL query: are two nodes linked in a directed graph?-CodePudding

I've been stuck on this SQL query for a day now, so I'm throwing it up here and would appreciate any advice others can give.

This is the problem: I want to generate a set of pairs of tags (named entities from articles), a and b, ordered by how many articles they co-occur in. This is relatively simple. However, there's a twist: the query should also check another table, link, to see if there's already an existing link between both tags. A link is a directed edge, ie. two tags could be connected either a->b or b->a.

As a minimum, I want to filter out all links where a and b are already connected - but a better implementation would allow me to return unfiltered pairs, with the type of the link whereever a link exists.

Here's the basic pair-generating query, which works as expected:

SELECT
   l.cluster AS left_id,
   l.cluster_type AS left_type,
   l.cluster_label AS left_label,
   r.cluster AS right_id,
   r.cluster_type AS right_type,
   r.cluster_label AS right_label,
   count(distinct(l.article)) AS articles
FROM tag AS l, tag AS r
WHERE
   l.cluster > r.cluster
   AND l.article = r.article
GROUP BY l.cluster, l.cluster_label, l.cluster_type, r.cluster, r.cluster_label, r.cluster_type
ORDER BY count(distinct(l.article)) DESC;

CTE-based approach

Here's a sort of solution to the sub-problem of getting all the pairs where a link exists:

WITH links AS (
  SELECT
    greatest(link.source_cluster, link.target_cluster) AS big,
    least(link.source_cluster, link.target_cluster) AS smol,
    link.type AS type
  FROM link AS link
)
SELECT l.cluster AS left_id, l.cluster_type AS left_type, l.cluster_label AS left_label, r.cluster AS right_id, r.cluster_type AS right_type, r.cluster_label AS right_label,
  count(distinct(l.article)) AS articles,
  array_agg(distinct(links.type)) AS link_types
FROM tag AS r, tag AS l
  JOIN links ON l.cluster = links.big
WHERE
  l.cluster > r.cluster
  AND l.article = r.article
  AND r.cluster = links.smol
GROUP BY l.cluster, l.cluster_label, l.cluster_type, r.cluster, r.cluster_label, r.cluster_type
ORDER BY count(distinct(l.article)) DESC

But this doesn't handle showing unlinked pairs, or showing both linked and unlinked pairs. Maybe there's some way of sub-querying the links CTE in the main query that would handle non-linked pairs?

Table definitions

CREATE TABLE tag (
    cluster character varying(40),
    article character varying(255),
    cluster_type character varying(10),
    cluster_label character varying,
);

CREATE TABLE link (
    source_cluster character varying(40),
    target_cluster character varying(40),
    type character varying(255),
);

Example data

tag:

"cluster","cluster_type","cluster_label","article"
"fffcc580c020f689e206fddbc32777f0d0866f23","LOC","Russia","a"
"fffcc580c020f689e206fddbc32777f0d0866f23","LOC","Russia","b"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","a"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","b"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","d"
"ff9be8adf69cddee1b910e592b119478388e2194","LOC","Moscow","a"
"ff9be8adf69cddee1b910e592b119478388e2194","LOC","Moscow","b"
"ffeeb6ebcdc1fe87a3a2b84d707e17bd716dd20b","LOC","Latvia","a"
"ffd364472a999c3d1001f5910398a53997ae0afe","ORG","OCCRP","a"
"ffd364472a999c3d1001f5910398a53997ae0afe","ORG","OCCRP","d"
"fef5381215b1dfded414f5e60469ce32f3334fdd","ORG","Moldindconbank","a"
"fef5381215b1dfded414f5e60469ce32f3334fdd","ORG","Moldindconbank","c"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","a"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","b"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","d"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","ORG","Moldova","a"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","ORG","Moldova","c"

link

"source_cluster","target_cluster","type"
"fff03a54c98cf079d562998d511ef2823d1f1863","fffcc580c020f689e206fddbc32777f0d0866f23","LOCATED"
"fe855a808f535efa417f6d082f5e5b6581fb6835","fff03a54c98cf079d562998d511ef2823d1f1863","EMPLOYER"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","fef5381215b1dfded414f5e60469ce32f3334fdd","LOCATED"

CodePudding user response：

I think what you want is a LEFT OUTER JOIN from tags to links, which doesn't filter out tag pairs without a link but annotates them with the link when it exists.

SELECT
   l.cluster AS left_id,
   l.cluster_type AS left_type,
   l.cluster_label AS left_label,
   r.cluster AS right_id,
   r.cluster_type AS right_type,
   r.cluster_label AS right_label,
   count(distinct(l.article)) AS articles,
   max(link.type) as type,
   l.cluster = max(link.source_cluster) AS "->"
FROM tag AS l CROSS JOIN tag AS r
LEFT OUTER JOIN link ON 
(l.cluster = link.source_cluster AND r.cluster = link.target_cluster)
OR
(l.cluster = link.target_cluster AND r.cluster = link.source_cluster)
WHERE
   l.cluster > r.cluster
   AND l.article = r.article
GROUP BY l.cluster, l.cluster_label, l.cluster_type, r.cluster, r.cluster_label, r.cluster_type
ORDER BY count(distinct(l.article)) DESC;

The max() business is me being lazy about fixing the fact that the query doesn't know there's at most one link per tag pair so it needs an aggregate function.

CodePudding user response：

To account for the possibility of links at a depth > 2 (the current link table only has a linking of depth 2: fff03a54c98cf079d562998d511ef2823d1f1863 > fe855a808f535efa417f6d082f5e5b6581fb6835 ), you can use a recursive cte to build the source_cluster paths to each target_cluster, which can then be flattened and left joined onto your main query:

with recursive cte(a, b, js) as (
  select l.source_cluster, l.target_cluster, jsonb_build_array(l.source_cluster) from link l
  union all
  select l.source_cluster, l.target_cluster, js || jsonb_build_array(l.source_cluster) from cte c 
  join link l on l.source_cluster = c.b
)
select t1.* from (
   select t1.cluster l1, t2.cluster l2, sum(case when t1.article = t2.article then 1 end) s 
   from tag t1 cross join tag t2
   where t1.cluster < t2.cluster
   group by t1.cluster, t2.cluster) t1
left join (select v#>>'{}' a, c.b b 
    from cte c cross join jsonb_array_elements(c.js) v) t2 
on greatest(t1.l1, t1.l2) = greatest(t2.a, t2.b) and least(t1.l1, t1.l2) = least(t2.a, t2.b)
where t2.a is null
order by t1.s desc

See fiddle.