Suppose the following data:
user_id timestamp value
--------------------------
2983 2022-01-01 12:01 100
2983 2022-01-01 12:04 106
2983 2022-01-01 12:06 101
2983 2022-01-01 12:10 110
2983 2022-01-01 12:11 112
2983 2022-01-01 12:13 115
2983 2022-01-01 12:15 102
I'm trying to perform queries such as:
select
max(value) from table
where user_id = 2983
and timestamp between '2022-01-01 12:10' and '2022-01-01 12:20'
There are billions of entries in this table, and the above query actually goes into a left join lateral
subquery of the following form, with the event
table itself containing hundreds of thousands of rows:
select
event.user_id,
event.timestamp_start,
event.timestamp_end,
value.max_value
from event
left join lateral (
select max(value)
from table
where table.user_id = event.user_id
and table.timestamp between event.timestamp_start and timestamp_end
) value on true
I'm thinking that conceptually, a two-dimensional index on timestamp, value
would help. The problem with a basic timestamp, value
index is that such an index is only able to order/index value
for a given timestamp
. It doesn't however reflect a true two-dimensional index. Add to that the additional complexity of the required indexing on user_id
and I'm a bit stuck about which index form to use, let alone if PostgreSQL has such a capability at all.
What index form would be the most optimal for the case illustrated here?
CodePudding user response:
The index that comes to mind is either
create index idx on table (user_id, timestamp, value desc);
or
create index idx on table (timestamp, user_id, value desc);
In some DBMS it is recommended to place the column with the higher selectivity first, in others it is supposed not to matter. I don't know how it is in PostgreSQL. You can provide both indexes and see the explain plan to check which is used and which is not.
As you are not looking for a single timestamp, but a range, it may or may not be preferable to simply include the value column. In theory above indexes should be better, as the DBMS can quickly pick the maximum value per timestamp and then pick the maximum of all these. But I don't know whether PostgreSQL uses this access method. If not, then a simpler index may (a little surprisingly) be the better option:
create index idx on table (user_id, timestamp) include (value);
or
create index idx on table (timestamp, user_id) include (value);
Again something to try and check.
CodePudding user response:
PostgreSQL supports GiST indexes, which are multidimensional in the way you want. But I doubt that that is the real solution for you, because GiST indexes have a lot of overhead and would require you to write your query in an awkward way in this case. You are probably better off with a Btree index including all three columns, which might be theoretically less efficient but has a much smaller overhead.
The best Btree index should be either (user_id, timestamp, value)
or on (user_id, value, timestamp)
, depending on how selective the timestamp range is likely to be. If the timestamp range is selective, you would want the former, and unselective, you would want the latter. By including all three columns in the index, you can get an index-only scan so that you don't have to visit a bunch of table rows. This is probably the key to getting good performance.
I started creating a GiST index on a 1e7 row dummy table so that I could test how exactly you could use it, but it hasn't finished building yet and I am bored. Like I said, high overhead.