I need to cache the first, last, and second to last time a thing happened per user. The history table I'm querying has hundreds of millions of rows (we're caching so we can truncate it), and the table I'm updating has dozens of millions.
Currently I'm doing it in batches of 1000 to avoid locking the tables. The query is like so:
with ranked as (
select
user_id,
rank() over (partition by user_id order by created_at desc) as ranked_desc,
rank() over (partition by user_id order by created_at asc) as ranked_asc,
created_at
from history
where type = 'SomeType' and
user_id between $1 and $2
)
update
users u
set
latest_at = (
select created_at
from ranked
where ranked.ranked_desc = 1 and ranked.user_id = u.id
),
previous_at = (
select created_at
from ranked
where ranked.ranked_desc = 2 and ranked.user_id = u.id
),
first_at = (
select created_at
from ranked
where ranked.ranked_asc = 1 and ranked.user_id = u.id
)
from ranked
where u.id = ranked.user_id
Relevant indexes on history are these. They are all btree indexes.
- (created_at)
- (user_id, created_at)
- (user_id, type)
- (type, created_at)
Can this be optimized? I feel this can be done without the subqueries.
CodePudding user response:
Since we have the all-important index on (user_id, created_at)
, I suggest:
UPDATE users u
SET first_at = h.first_at
, latest_at = h.latest_at
, previous_at = h.previous_at
FROM (
SELECT u.id, f.first_at, l.last[1] AS latest_at, l.last[2] AS previous_at
FROM users u
CROSS JOIN LATERAL (
SELECT ARRAY (
SELECT h.created_at
FROM history h
WHERE h.user_id = u.id
AND h.type = 'SomeType' -- ??
ORDER BY h.created_at DESC
LIMIT 2
) AS last
) l
CROSS JOIN LATERAL (
SELECT created_at AS first_at
FROM history h
WHERE h.user_id = u.id
AND h.type = 'SomeType' -- ??
ORDER BY created_at
LIMIT 1
) f
WHERE u.id BETWEEN $1 AND $2
) h
WHERE u.id = h.id
AND (u.first_at IS DISTINCT FROM h.first_at
OR u.latest_at IS DISTINCT FROM h.latest_at
OR u.previous_at IS DISTINCT FROM h.previous_at);
This works with non-unique timestamps per user_id
, too.
And it's very efficient if there are many rows per user. It's designed to avoid a sequential scan on the big table and make heavy use of the index on (user_id, created_at)
instead.
Related:
Assuming most or all users get updated this way, we don't need an index on users
. (For the purpose of this UPDATE
, no index would be best.)
If there is only a single row in table history
for a user, then previous_at
is set to NULL
. (Your original query has the same effect.)
Only users are updated where qualifying history rows are found.
This added WHERE
clause skips updates that would not change anything (at full cost):
AND (u.first_at IS DISTINCT FROM h.first_at
OR u.latest_at IS DISTINCT FROM h.latest_at
OR u.previous_at IS DISTINCT FROM h.previous_at)
See:
The only insecurity is with WHERE type = 'SomeType'
. If that's selective, a partial index with the same predicate would be better. Then we could even get index-only scans ...
Since the new query should be much faster, you might update more (or all) users at once.