In a Postgres database, I've noticed I can add a compound index key on the same set of columns, but in different orders. E.g:
CREATE INDEX index_1 ON public.A USING btree (x, y); CREATE INDEX index_2 ON public.A USING btree (y, x);
Note the indexes only differ by the order of the columns indexed.
I would have expected Postgres to treat the column list as an unordered set.
Would this ability to add two similar indexes imply that the order of the conditions in the WHERE and ON clauses matters for retrieval speed?
CodePudding user response:
I would have expected Postgres to treat the column list as an unordered set.
There is a very simple reason why this is not the case: computer memory is not multi-dimensional.
Imagine you have 10 items, each with a unique combination of x
and y
. On paper, you could plot them on a grid, with x
on one axis, and y
on the other. To find all values for a particular x
, you could look along the appropriate row; for all values for a particular y
, look down the appropriate column. If you need three variables, it becomes harder to draw on paper, but you can visualise a cube; mathematicians will visualise higher dimensional "hypercubes".
Computer memory (and storage) doesn't work that way; it's just a linear list of records, one after another. So your two-element index isn't stored as a grid, it's stored as a sorted list. If the index is on (x, y)
, the basic arrangement looks like this:
- Sort all the records by
x
- Within each group that has the same
x
, sort those records byy
(The actual structure is a bit more complex than just a list, so that it's easy to jump to the right group, and you don't have to rewrite the whole index every time a low value of x
needs to be added, but I think the general principle is good enough for the current discussion.)
Now, if you want to find all elements with a particular x
, you can jump to that group of records, and return them all. But if you want to find all elements with a particular y
, you have to look through every group to see if any are there.
Would this ability to add two similar indexes imply that the order of the conditions in the WHERE and ON clauses matters for retrieval speed?
The order they appear in the query generally doesn't matter, because a query defines a logical result, not an order of execution. What matters is the order the DBMS wants to check them to return that data efficiently.
For instance, imagine we know that x
and y
are both evenly distributed from 0
to 99
. Given the clause WHERE x = 0 AND y = 42
, it doesn't matter which index we use:
- If we use an index on
(x, y)
, we jump to the bucket forx=0
and find items in that bucket withy=42
- If we use an index on
(y, x)
, we jump to the bucket fory=42
and find items in that bucket withx=0
But if want to process the clause WHERE x <> 0 AND y = 42
, it makes a huge difference:
- If we use an index on
(x, y)
, we have to look through the buckets forx=1
,x=2
, ...x=99
; then in each bucket, look for records withy=42
- If we use an index on
(y, x)
, we can jump straight to the bucket fory=42
, skip past the items forx=0
, and return the rest of the bucket
This is the kind of decision that the query planner in a DBMS will evaluate: which order of operations is most likely, based on available statistics about the data, to result in the fastest results.
While these query planners are incredibly sophisticated, you have knowledge about your data, and the ways you want to use it, that they don't. Even if you trust the query planner never to pick the wrong one, adding more indexes has a cost, in storage, and in extra processing on every INSERT, UPDATE, and DELETE.