Home > OS >  PostgreSQL select distinct JSONB poor performance
PostgreSQL select distinct JSONB poor performance

Time:11-14

Let's say I have the following PostgreSQL table called products which has millions of records:

CREATE TABLE IF NOT EXISTS mytable (
    id serial NOT NULL PRIMARY KEY,
    label VARCHAR(50) NOT NULL,
    info jsonb NOT NULL,
    created_at timestamp NOT NULL DEFAULT now()
);

I have a SELECT statement targeting this table with several JOIN's that is generating duplicate rows. This appears to be common: Why do multiple-table joins produce duplicate rows?

I know I can fix this issue using SELECT DISTINCT .... However, the query is taking several seconds, whereas the vanilla SELECT ... query takes milliseconds.

I presume this has to do with the info JSONB field, which can be very large. When I remove info from the DISTINCT calculation by using SELECT DISTINCT ON (id) ... then the query is much faster.

However, DISTINCT ON breaks some of my queries that use ORDER BY [non-id field] due to this condition:

SELECT DISTINCT ON expressions must match initial ORDER BY expressions

I've noticed I can fix the error by using subqueries (good example here):

SELECT * FROM (
    SELECT DISTINCT ON (id) ...
) ORDER BY [non-id field]

Two questions:

  • Is passing a JSONB field through DISTINCT a known performance problem? I want to make sure my theory is reasonable.
  • Is my solution of using subqueries a good solution for fixing the SELECT DISTINCT ON expressions must match initial ORDER BY expressions error? Or is there a better solution I'm not thinking of?

CodePudding user response:

Is passing a JSONB field through DISTINCT a known performance problem?

Passing any large column through DISTINCT is a known performance problem. jsonb isn't special in this regard.

Or is there a better solution I'm not thinking of?

That depends. You didn't provide requested information, yet.

CodePudding user response:

You can get by with a single sort and still have good performance:

SELECT DISTINCT ON ([non-id column], id) ...
ORDER BY [non-id column], id;
  • Related