I have a table notifications
which contain a payload
column of type jsonb
with a gin index on this column. The talbe currently contains 2,742,691 rows
The table looks something like this:
id | payload | created_at |
---|---|---|
1 | {"customer": {"email": "[email protected]", "externalId": 111 } | 2022-06-21 |
2 | {"customer": {"email": "[email protected]", "externalId": 222 } | 2022-06-20 |
3 | {"customer": {"email": "[email protected]", "externalId": 333 } | 2022-06-20 |
4 | {"customer": {"email": "[email protected]", "externalId": 444 } | 2022-04-14 |
5 | {"customer": {"email": "[email protected]", "externalId": 555 } | 2022-04-12 |
6 | {"customer": {"email": "[email protected]", "externalId": 666 } | 2022-06-10 |
7 | {"customer": {"email": "[email protected]", "externalId": 666 } | 2022-06-11 |
I am trying to query a list of email addresses that match the following condition:
- multiple rows for the same
email
address exist - one of those rows does have a different
externalId
than one of the previous ones created_at
is within the last month
For the example table contents, this should only return [email protected]
because
[email protected]
only appears once[email protected]
doesnt have a row that was created within the last month[email protected]
has multiple rows but all of them have the sameexternalId
What I was trying is using a LEFT JOIN LATERAL
like this:
select
n.payload -> 'customer' -> 'email'
from
notifications n
left join lateral (
select
n2.payload -> 'customer' ->> 'externalId' tid
from
notifications n2
where
n2.payload @> jsonb_build_object(
'customer',
jsonb_build_object('email', n.payload -> 'customer' -> 'email')
)
and not (n2.payload @> jsonb_build_object(
'customer',
jsonb_build_object('externalId', n.payload -> 'customer' -> 'externalId')
))
and n2.created_at > NOW() - INTERVAL '1 month'
limit
1
) sub on true
where
n.created_at > NOW() - INTERVAL '1 month'
and sub.tid is not null;
however, this is taking ages to run. The Query plan for this looks like https://explain.depesz.com/s/mriB
QUERY PLAN
Nested Loop (cost=0.17..53386349.38 rows=259398 width=32)
-> Index Scan using index_notifications_created_at on notifications n (cost=0.09..51931.08 rows=259398 width=514)
Index Cond: (created_at > (now() - '1 mon'::interval))
-> Subquery Scan on sub (cost=0.09..205.60 rows=1 width=0)
Filter: (sub.tid IS NOT NULL)
-> Limit (cost=0.09..205.60 rows=1 width=32)
-> Index Scan using index_notifications_created_at on notifications n2 (cost=0.09..53228.33 rows=259 width=32)
Index Cond: (created_at > (now() - '1 mon'::interval))
Filter: ((payload @> jsonb_build_object('customer', jsonb_build_object('email', ((n.payload -> 'customer'::text) -> 'email'::text)))) AND (NOT (payload @> jsonb_build_object('customer', jsonb_build_object('externalId', ((n.payload -> 'customer'::text) -> 'externalId'::text))))))
JIT:
Functions: 13
Options: Inlining true, Optimization true, Expressions true, Deforming true
Any pointers what I'm doing wrong here / how to optimize this?
CodePudding user response:
Here is my suggestion. It uses function array_unique
by @klin from this SO post.
select email from
(
select
payload -> 'customer' ->> 'email' email,
array_agg(payload -> 'customer' ->> 'externalId') externalid_arr
from notifications
where created_at >= current_date - interval 'P1M'
-- created_at is within the last month
group by email
having count(*) > 1
-- multiple rows for the same email address exist
) t
where array_length(array_unique(externalid_arr), 1) > 1;
-- one of those rows does have a different externalId than ...
CodePudding user response:
Simple nested queries will do the trick without any need for joining or functions:
SELECT email
FROM (
SELECT email, exid
FROM (
SELECT payload -> 'customer' -> 'email' AS email,
payload -> 'customer' -> 'externalId' AS exid
FROM notifications
WHERE created_at > CURRENT_DATE - INTERVAL '1 month' ) recent
GROUP BY 1, 2 ) emails
GROUP BY 1
HAVING count(*) > 1
This will use your index on created_at
and should thus be reasonable quick.
CodePudding user response:
Your main mistake is that when you've joined notifications
laterally you create 4 JSON objects per row in the table limited by condition created_at > NOW() - INTERVAL '1 month'
. This condition limits row count to 259398, and now your subquery need to create 259398 * 4 = 1 037 592 JSONS. And finally LIMIT
is used to get only 1 row when all rows in JOIN
are processed.
You should refactor your query.
You can use CTE to obtain email
, distinct external_id
count and maximum created_at
per email
value like this
WITH cte(email, ext_id_count, max_created_at) AS (
SELECT
DISTINCT payload -> 'customer' -> 'email',
COUNT(DISTINCT payload -> 'customer' -> 'externalId'),
MAX(created_at)
FROM notifications
GROUP BY payload -> 'customer' -> 'email'
)
SELECT email FROM cte
WHERE ext_id_count > 1 AND max_created_at > CURRENT_DATE - INTERVAL '1 month'
Please, check a demo