Optimize updating first, last, and second to last ranked value-CodePudding

I need to cache the first, last, and second to last time a thing happened per user. The history table I'm querying has hundreds of millions of rows (we're caching so we can truncate it), and the table I'm updating has dozens of millions.

Currently I'm doing it in batches of 1000 to avoid locking the tables. The query is like so:

with ranked as (
  select
      user_id,
      rank() over (partition by user_id order by created_at desc) as ranked_desc,
      rank() over (partition by user_id order by created_at asc) as ranked_asc,
      created_at
  from history
  where type = 'SomeType' and
        user_id between $1 and $2
)
update
  users u
set
  latest_at = (
    select created_at
    from ranked
    where ranked.ranked_desc = 1 and ranked.user_id = u.id
  ),
  previous_at = (
    select created_at
    from ranked
    where ranked.ranked_desc = 2 and ranked.user_id = u.id
  ),
  first_at = (
    select created_at
    from ranked
    where ranked.ranked_asc = 1 and ranked.user_id = u.id
  )
from ranked
where u.id = ranked.user_id

Relevant indexes on history are these. They are all btree indexes.

(created_at)
(user_id, created_at)
(user_id, type)
(type, created_at)

Can this be optimized? I feel this can be done without the subqueries.

CodePudding user response：

Since we have the all-important index on (user_id, created_at), I suggest:

UPDATE users u
SET    first_at    = h.first_at
     , latest_at   = h.latest_at
     , previous_at = h.previous_at
FROM  (
   SELECT u.id, f.first_at, l.last[1] AS latest_at, l.last[2] AS previous_at
   FROM   users u
   CROSS  JOIN LATERAL (
      SELECT ARRAY (
         SELECT h.created_at
         FROM   history h
         WHERE  h.user_id = u.id
         AND    h.type = 'SomeType'  -- ??
         ORDER  BY h.created_at DESC
         LIMIT  2
         ) AS last
      ) l
   CROSS  JOIN LATERAL (
      SELECT created_at AS first_at
      FROM   history h
      WHERE  h.user_id = u.id
      AND    h.type = 'SomeType'  -- ??
      ORDER  BY created_at
      LIMIT  1
      ) f
   WHERE  u.id BETWEEN $1 AND $2
   ) h
WHERE  u.id = h.id
AND   (u.first_at    IS DISTINCT FROM h.first_at
    OR u.latest_at   IS DISTINCT FROM h.latest_at
    OR u.previous_at IS DISTINCT FROM h.previous_at);

This works with non-unique timestamps per user_id, too.

And it's very efficient if there are many rows per user. It's designed to avoid a sequential scan on the big table and make heavy use of the index on (user_id, created_at) instead. Related:

Optimize GROUP BY query to retrieve latest row per user

Assuming most or all users get updated this way, we don't need an index on users. (For the purpose of this UPDATE, no index would be best.)

If there is only a single row in table history for a user, then previous_at is set to NULL. (Your original query has the same effect.)

Only users are updated where qualifying history rows are found.

This added WHERE clause skips updates that would not change anything (at full cost):

AND   (u.first_at    IS DISTINCT FROM h.first_at
    OR u.latest_at   IS DISTINCT FROM h.latest_at
    OR u.previous_at IS DISTINCT FROM h.previous_at)

See:

How do I (or can I) SELECT DISTINCT on multiple columns?

The only insecurity is with WHERE type = 'SomeType'. If that's selective, a partial index with the same predicate would be better. Then we could even get index-only scans ...

Since the new query should be much faster, you might update more (or all) users at once.