Let's say I have the following PostgreSQL table called products
which has millions of records:
CREATE TABLE IF NOT EXISTS mytable (
id serial NOT NULL PRIMARY KEY,
label VARCHAR(50) NOT NULL,
info jsonb NOT NULL,
created_at timestamp NOT NULL DEFAULT now()
);
I have a SELECT
statement targeting this table with several JOIN
's that is generating duplicate rows. This appears to be common: Why do multiple-table joins produce duplicate rows?
I know I can fix this issue using SELECT DISTINCT ...
. However, the query is taking several seconds, whereas the vanilla SELECT ...
query takes milliseconds.
I presume this has to do with the info
JSONB field, which can be very large. When I remove info
from the DISTINCT
calculation by using SELECT DISTINCT ON (id) ...
then the query is much faster.
However, DISTINCT ON
breaks some of my queries that use ORDER BY [non-id field]
due to this condition:
SELECT DISTINCT ON expressions must match initial ORDER BY expressions
I've noticed I can fix the error by using subqueries (good example here):
SELECT * FROM (
SELECT DISTINCT ON (id) ...
) ORDER BY [non-id field]
Two questions:
- Is passing a JSONB field through
DISTINCT
a known performance problem? I want to make sure my theory is reasonable. - Is my solution of using subqueries a good solution for fixing the
SELECT DISTINCT ON expressions must match initial ORDER BY expressions
error? Or is there a better solution I'm not thinking of?
CodePudding user response:
Is passing a JSONB field through
DISTINCT
a known performance problem?
Passing any large column through DISTINCT
is a known performance problem. jsonb
isn't special in this regard.
Or is there a better solution I'm not thinking of?
That depends. You didn't provide requested information, yet.
CodePudding user response:
You can get by with a single sort and still have good performance:
SELECT DISTINCT ON ([non-id column], id) ...
ORDER BY [non-id column], id;