Home > Software design >  Use DISTINCT to display unique items
Use DISTINCT to display unique items

Time:10-13

I have 2 tables which I would like to query and display data differences:

CREATE TABLE order_splits_config (
    id                   INT,
    pair_id              INT
);

CREATE TABLE active_pairs (
    id                  INT,
    pair                VARCHAR(30),
    exchange_active     boolean,
    exchange_id         INT
);

INSERT INTO order_splits_config(id, pair_id)
VALUES (1, 83);
INSERT INTO order_splits_config(id, pair_id)
VALUES (2, 58);
INSERT INTO order_splits_config(id, pair_id)
VALUES (34, 34);

INSERT INTO active_pairs(id, pair, exchange_active, exchange_id)
VALUES (1, 'US/EN', true, 2);
INSERT INTO active_pairs(id, pair, exchange_active, exchange_id)
VALUES (2, 'GB/UK', true, 3);
INSERT INTO active_pairs(id, pair, exchange_active, exchange_id)
VALUES (2, 'FR/EU', true, 4);

I use this query to query the differences:

SELECT b.id, b.pair, b.exchange_id
FROM order_splits_config a
FULL OUTER JOIN active_pairs b
ON a.pair_id = b.id
WHERE a.pair_id IS NULL
OR b.id IS NULL
AND b.exchange_active = 'true';

This prints a lot of lines like this for example (data is just for example):

#,pair,id,exchange_id
1, US/EN,332,1
2, GB/UK,112,1
3, GB/UK,113,1
4, FR/EU,221,5
5, FR/EU,183,2
...

How can I use DISTINCT in order to get the result from the query unique by pair?

FIDDLE: https://www.db-fiddle.com/f/4D6VfqysPCWhQnh8zaFBps/2

CodePudding user response:

You only select columns from table active_pairs to begin with.

After joining on a.pair_id = b.id, the condition a.pair_id IS NULL is only true if there is no matching row in table order_splits_config AS a, which is more efficiently formulated as NOT EXISTS instead of the FULL JOIN.

As for b.id IS NULL AND b.exchange_active = 'true': again, b.id can only be NULL for rows of a where no matching b is found - in which case b.exchange_active can never be true. So the clause effectively eliminates all rows of a that were just added by the FULL JOIN and which would show up in the result as all NULL values since there are only columns from b. In short: don't FULL JOIN to begin with.

If active_pairs.pair is defined UNIQUE, it boils down to:

SELECT id, pair, exchange_id
FROM   active_pairs b
WHERE  NOT EXISTS (SELECT FROM order_splits_config a WHERE a.pair_id = b.id);

If active_pairs.pair is not defined UNIQUE, the simple solution is with DISTINCT ON:

SELECT DISTINCT ON (pair)
       id, pair, exchange_id
FROM   active_pairs b
WHERE  NOT EXISTS (SELECT FROM order_splits_config a WHERE a.pair_id = b.id);

db<>fiddle here

From each set of duplicates on pair, you get an arbitrary pick.
For a deterministic pick, define what to pick and add an ORDER BY clause accordingly. Example: to get the one with the smallest id, add:

...
ORDER BY pair, id;

Also adds a sort order to the so far unsorted result.

If order_splits_config isn't trivially small, have an index on (pair_id) to make this fast.

There may be much faster solutions for a big active_pairs table, depending on undisclosed information, mostly the cardinality of column pair (how many duplicate values).

See:

  • Related