Home > Software design >  Select rows with jsonb column that have different values in one jsonb attribute and matching values
Select rows with jsonb column that have different values in one jsonb attribute and matching values

Time:06-23

I have a table notifications which contain a payload column of type jsonb with a gin index on this column. The talbe currently contains 2,742,691 rows

The table looks something like this:

id payload created_at
1 {"customer": {"email": "[email protected]", "externalId": 111 } 2022-06-21
2 {"customer": {"email": "[email protected]", "externalId": 222 } 2022-06-20
3 {"customer": {"email": "[email protected]", "externalId": 333 } 2022-06-20
4 {"customer": {"email": "[email protected]", "externalId": 444 } 2022-04-14
5 {"customer": {"email": "[email protected]", "externalId": 555 } 2022-04-12
6 {"customer": {"email": "[email protected]", "externalId": 666 } 2022-06-10
7 {"customer": {"email": "[email protected]", "externalId": 666 } 2022-06-11

I am trying to query a list of email addresses that match the following condition:

  • multiple rows for the same email address exist
  • one of those rows does have a different externalId than one of the previous ones
  • created_at is within the last month

For the example table contents, this should only return [email protected] because

What I was trying is using a LEFT JOIN LATERAL like this:

select
  n.payload -> 'customer' -> 'email'
from
  notifications n
  left join lateral (
    select
      n2.payload -> 'customer' ->> 'externalId' tid
    from
      notifications n2
    where
      n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('email', n.payload -> 'customer' -> 'email')
      )
      and not (n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('externalId', n.payload -> 'customer' -> 'externalId')
      ))
      and n2.created_at > NOW() - INTERVAL '1 month' 
    limit
      1
  ) sub on true
where
  n.created_at > NOW() - INTERVAL '1 month'
  and sub.tid is not null;

however, this is taking ages to run. The Query plan for this looks like https://explain.depesz.com/s/mriB

QUERY PLAN
Nested Loop  (cost=0.17..53386349.38 rows=259398 width=32)
  ->  Index Scan using index_notifications_created_at on notifications n  (cost=0.09..51931.08 rows=259398 width=514)
        Index Cond: (created_at > (now() - '1 mon'::interval))
  ->  Subquery Scan on sub  (cost=0.09..205.60 rows=1 width=0)
        Filter: (sub.tid IS NOT NULL)
        ->  Limit  (cost=0.09..205.60 rows=1 width=32)
              ->  Index Scan using index_notifications_created_at on notifications n2  (cost=0.09..53228.33 rows=259 width=32)
                    Index Cond: (created_at > (now() - '1 mon'::interval))
                    Filter: ((payload @> jsonb_build_object('customer', jsonb_build_object('email', ((n.payload -> 'customer'::text) -> 'email'::text)))) AND (NOT (payload @> jsonb_build_object('customer', jsonb_build_object('externalId', ((n.payload -> 'customer'::text) -> 'externalId'::text))))))
JIT:
  Functions: 13
  Options: Inlining true, Optimization true, Expressions true, Deforming true

Any pointers what I'm doing wrong here / how to optimize this?

CodePudding user response:

Here is my suggestion. It uses function array_unique by @klin from this SO post.

select email from
(
  select 
         payload -> 'customer' ->> 'email' email, 
         array_agg(payload -> 'customer' ->> 'externalId') externalid_arr
  from notifications 
  where created_at >= current_date - interval 'P1M'
  -- created_at is within the last month
  group by email
  having count(*) > 1 
  -- multiple rows for the same email address exist
) t 
where array_length(array_unique(externalid_arr), 1) > 1; 
-- one of those rows does have a different externalId than ...

CodePudding user response:

Simple nested queries will do the trick without any need for joining or functions:

SELECT email
FROM (
    SELECT email, exid
    FROM (
        SELECT payload -> 'customer' -> 'email' AS email,
               payload -> 'customer' -> 'externalId' AS exid
        FROM notifications
        WHERE created_at > CURRENT_DATE - INTERVAL '1 month' ) recent 
    GROUP BY 1, 2 ) emails
GROUP BY 1
HAVING count(*) > 1

This will use your index on created_at and should thus be reasonable quick.

CodePudding user response:

Your main mistake is that when you've joined notifications laterally you create 4 JSON objects per row in the table limited by condition created_at > NOW() - INTERVAL '1 month'. This condition limits row count to 259398, and now your subquery need to create 259398 * 4 = 1 037 592 JSONS. And finally LIMIT is used to get only 1 row when all rows in JOIN are processed.

You should refactor your query.

You can use CTE to obtain email, distinct external_id count and maximum created_at per email value like this

WITH cte(email, ext_id_count, max_created_at) AS (
    SELECT
        DISTINCT payload -> 'customer' -> 'email', 
        COUNT(DISTINCT payload -> 'customer' -> 'externalId'), 
        MAX(created_at)
    FROM notifications
    GROUP BY payload -> 'customer' -> 'email'
)
SELECT email FROM cte 
WHERE ext_id_count > 1 AND max_created_at > CURRENT_DATE - INTERVAL '1 month'

Please, check a demo

  • Related