Home > Software engineering >  Multiple-dimensional index on two float columns
Multiple-dimensional index on two float columns

Time:10-11

Suppose the following data:

user_id             timestamp    value
--------------------------
2983         2022-01-01 12:01       100
2983         2022-01-01 12:04       106
2983         2022-01-01 12:06       101
2983         2022-01-01 12:10       110
2983         2022-01-01 12:11       112
2983         2022-01-01 12:13       115
2983         2022-01-01 12:15       102

I'm trying to perform queries such as:

select
  max(value) from table
where user_id = 2983
  and timestamp between '2022-01-01 12:10' and '2022-01-01 12:20'

There are billions of entries in this table, and the above query actually goes into a left join lateral subquery of the following form, with the event table itself containing hundreds of thousands of rows:

select
  event.user_id,
  event.timestamp_start,
  event.timestamp_end,
  value.max_value
from event
left join lateral (
    select max(value)
    from table
    where table.user_id = event.user_id
      and table.timestamp between event.timestamp_start and timestamp_end
) value on true

I'm thinking that conceptually, a two-dimensional index on timestamp, value would help. The problem with a basic timestamp, value index is that such an index is only able to order/index value for a given timestamp. It doesn't however reflect a true two-dimensional index. Add to that the additional complexity of the required indexing on user_id and I'm a bit stuck about which index form to use, let alone if PostgreSQL has such a capability at all.

What index form would be the most optimal for the case illustrated here?

CodePudding user response:

The index that comes to mind is either

create index idx on table (user_id, timestamp, value desc);

or

create index idx on table (timestamp, user_id, value desc);

In some DBMS it is recommended to place the column with the higher selectivity first, in others it is supposed not to matter. I don't know how it is in PostgreSQL. You can provide both indexes and see the explain plan to check which is used and which is not.

As you are not looking for a single timestamp, but a range, it may or may not be preferable to simply include the value column. In theory above indexes should be better, as the DBMS can quickly pick the maximum value per timestamp and then pick the maximum of all these. But I don't know whether PostgreSQL uses this access method. If not, then a simpler index may (a little surprisingly) be the better option:

create index idx on table (user_id, timestamp) include (value);

or

create index idx on table (timestamp, user_id) include (value);

Again something to try and check.

CodePudding user response:

PostgreSQL supports GiST indexes, which are multidimensional in the way you want. But I doubt that that is the real solution for you, because GiST indexes have a lot of overhead and would require you to write your query in an awkward way in this case. You are probably better off with a Btree index including all three columns, which might be theoretically less efficient but has a much smaller overhead.

The best Btree index should be either (user_id, timestamp, value) or on (user_id, value, timestamp), depending on how selective the timestamp range is likely to be. If the timestamp range is selective, you would want the former, and unselective, you would want the latter. By including all three columns in the index, you can get an index-only scan so that you don't have to visit a bunch of table rows. This is probably the key to getting good performance.

I started creating a GiST index on a 1e7 row dummy table so that I could test how exactly you could use it, but it hasn't finished building yet and I am bored. Like I said, high overhead.

  • Related