Home > Enterprise >  SQL efficiently select distinct rows from one table depending on two columns from another table
SQL efficiently select distinct rows from one table depending on two columns from another table

Time:07-11

I have two tables in my Postgres database, nodes and egdes. I am trying to find the immediate neighbors for an arbitrary subset of node ids.

This is my schema:

CREATE TABLE nodes (
  id INT,
  prop_1 VARCHAR(64),
  prop_2 VARCHAR(64)
);

CREATE TABLE edges (
  source_id INT,
  target_id INT
);

With some sample data:


INSERT INTO nodes (id, prop_1, prop_2) VALUES 
(1, 'lorem', 'ipsum'),
(2, 'dolor', 'sit'),
(3, 'conseteteur', 'sadipiscing'),
(4, 'elitr', 'sed'),
(5, 'diam', 'nonumy');

INSERT INTO edges (source_id, target_id) VALUES 
(1, 2),
(1, 3),
(1, 4),
(3, 4),
(2, 4),
(4, 5);

Where source_id and target_id act as foreign keys referencing nodes.

Immediate neighbors are defined as nodes whose id appears at least once in the edges table, either under the source_id or the target_id column, and on the same row as one of my arbitrary ids.

I think I have got it almost working, but my query returns duplicates, and is most likely inefficient. Here is an SQL Fiddle with the above schema, where I am trying to find distinct immediate neighbors for the nodes with id of 1 and 5, using the following query:

WITH sources(id) AS (
    SELECT target_id FROM edges WHERE edges.source_id IN (1, 5)
), targets(id) AS (
    SELECT source_id FROM edges WHERE edges.target_id IN (1, 5)
) SELECT DISTINCT * FROM nodes, sources, targets WHERE nodes.id = sources.id OR nodes.id = targets.id

I would expect my query to return the following rows from nodes:

id prop_1 prop_2
2 dolor sit
3 conseteteur sadipiscing
4 elitr sed

As you can tell from the fiddle, I seem to be getting the right results, but they contain both duplicates and superfluous columns.

How can I return only the correct rows from nodes? Is my approach especially inefficient?

CodePudding user response:

Your query will return correct results if you select only from nodes:

SELECT DISTINCT nodes.* 
FROM ...

but, I would suggest either a proper join of the tables:

SELECT DISTINCT n.*
FROM nodes n INNER JOIN edges e
ON n.id = CASE WHEN e.source_id IN (1, 5) THEN e.target_id ELSE e.source_id END
WHERE e.source_id IN (1, 5) OR e.target_id IN (1, 5)
ORDER BY n.id;

or EXISTS:

SELECT n.*
FROM nodes n
WHERE EXISTS (
        SELECT *
        FROM edges e
        WHERE (e.source_id IN (1, 5) OR e.target_id IN (1, 5))
          AND n.id = CASE WHEN e.source_id IN (1, 5) THEN e.target_id ELSE e.source_id END
      )
ORDER BY n.id;

See the demo.

CodePudding user response:

A simple UNION can get rid of the duplicates, since the default behavior is UNION DISTINCT.

But, of course, the IN clause alone does that too, since it doesn't matter how many instances of a value are contained in the IN clause list, each outer node row can be in the result at most once.

SELECT n.*
  FROM nodes AS n
 WHERE id IN (
         SELECT target_id FROM edges WHERE edges.source_id IN (1, 5)
          UNION
         SELECT source_id FROM edges WHERE edges.target_id IN (1, 5)
       )
;

The fiddle

  • Related