Select rows with jsonb column that have different values in one jsonb attribute and matching values-CodePudding

I have a table notifications which contain a payload column of type jsonb with a gin index on this column. The talbe currently contains 2,742,691 rows

The table looks something like this:

id	payload	created_at
1	{"customer": {"email": "[email protected]", "externalId": 111 }	2022-06-21
2	{"customer": {"email": "[email protected]", "externalId": 222 }	2022-06-20
3	{"customer": {"email": "[email protected]", "externalId": 333 }	2022-06-20
4	{"customer": {"email": "[email protected]", "externalId": 444 }	2022-04-14
5	{"customer": {"email": "[email protected]", "externalId": 555 }	2022-04-12
6	{"customer": {"email": "[email protected]", "externalId": 666 }	2022-06-10
7	{"customer": {"email": "[email protected]", "externalId": 666 }	2022-06-11

I am trying to query a list of email addresses that match the following condition:

multiple rows for the same email address exist
one of those rows does have a different externalId than one of the previous ones
created_at is within the last month

For the example table contents, this should only return [email protected] because

[email protected] only appears once
[email protected] doesnt have a row that was created within the last month
[email protected] has multiple rows but all of them have the same externalId

What I was trying is using a LEFT JOIN LATERAL like this:

select
  n.payload -> 'customer' -> 'email'
from
  notifications n
  left join lateral (
    select
      n2.payload -> 'customer' ->> 'externalId' tid
    from
      notifications n2
    where
      n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('email', n.payload -> 'customer' -> 'email')
      )
      and not (n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('externalId', n.payload -> 'customer' -> 'externalId')
      ))
      and n2.created_at > NOW() - INTERVAL '1 month' 
    limit
      1
  ) sub on true
where
  n.created_at > NOW() - INTERVAL '1 month'
  and sub.tid is not null;

however, this is taking ages to run. The Query plan for this looks like https://explain.depesz.com/s/mriB

QUERY PLAN
Nested Loop  (cost=0.17..53386349.38 rows=259398 width=32)
  ->  Index Scan using index_notifications_created_at on notifications n  (cost=0.09..51931.08 rows=259398 width=514)
        Index Cond: (created_at > (now() - '1 mon'::interval))
  ->  Subquery Scan on sub  (cost=0.09..205.60 rows=1 width=0)
        Filter: (sub.tid IS NOT NULL)
        ->  Limit  (cost=0.09..205.60 rows=1 width=32)
              ->  Index Scan using index_notifications_created_at on notifications n2  (cost=0.09..53228.33 rows=259 width=32)
                    Index Cond: (created_at > (now() - '1 mon'::interval))
                    Filter: ((payload @> jsonb_build_object('customer', jsonb_build_object('email', ((n.payload -> 'customer'::text) -> 'email'::text)))) AND (NOT (payload @> jsonb_build_object('customer', jsonb_build_object('externalId', ((n.payload -> 'customer'::text) -> 'externalId'::text))))))
JIT:
  Functions: 13
  Options: Inlining true, Optimization true, Expressions true, Deforming true

Any pointers what I'm doing wrong here / how to optimize this?

CodePudding user response：

Here is my suggestion. It uses function array_unique by @klin from this SO post.

select email from
(
  select 
         payload -> 'customer' ->> 'email' email, 
         array_agg(payload -> 'customer' ->> 'externalId') externalid_arr
  from notifications 
  where created_at >= current_date - interval 'P1M'
  -- created_at is within the last month
  group by email
  having count(*) > 1 
  -- multiple rows for the same email address exist
) t 
where array_length(array_unique(externalid_arr), 1) > 1; 
-- one of those rows does have a different externalId than ...

CodePudding user response：

Simple nested queries will do the trick without any need for joining or functions:

SELECT email
FROM (
    SELECT email, exid
    FROM (
        SELECT payload -> 'customer' -> 'email' AS email,
               payload -> 'customer' -> 'externalId' AS exid
        FROM notifications
        WHERE created_at > CURRENT_DATE - INTERVAL '1 month' ) recent 
    GROUP BY 1, 2 ) emails
GROUP BY 1
HAVING count(*) > 1

This will use your index on created_at and should thus be reasonable quick.

CodePudding user response：

Your main mistake is that when you've joined notifications laterally you create 4 JSON objects per row in the table limited by condition created_at > NOW() - INTERVAL '1 month'. This condition limits row count to 259398, and now your subquery need to create 259398 * 4 = 1 037 592 JSONS. And finally LIMIT is used to get only 1 row when all rows in JOIN are processed.

You should refactor your query.

You can use CTE to obtain email, distinct external_id count and maximum created_at per email value like this

WITH cte(email, ext_id_count, max_created_at) AS (
    SELECT
        DISTINCT payload -> 'customer' -> 'email', 
        COUNT(DISTINCT payload -> 'customer' -> 'externalId'), 
        MAX(created_at)
    FROM notifications
    GROUP BY payload -> 'customer' -> 'email'
)
SELECT email FROM cte 
WHERE ext_id_count > 1 AND max_created_at > CURRENT_DATE - INTERVAL '1 month'

Please, check a demo