Home > Software engineering >  What's the best way performance-wise to SELECT from a PARTITIONed table in MySQL: explicit part
What's the best way performance-wise to SELECT from a PARTITIONed table in MySQL: explicit part

Time:09-26

Say you have:

CREATE TABLE  demo (
    amount ,
    year ,
    cycle ,
    otherStuff ,
    PRIMARY KEY ( id , year , cycle )
) ENGINE = INNODB
PARTITION BY RANGE ( year )
SUBPARTITION BY KEY ( cycle ) 
SUBPARTITIONS 12 (
    PARTITION p2020 VALUES LESS THAN (2021) ,
    PARTITION p2021 VALUES LESS THAN (2022) ,
    PARTITION p2022 VALUES LESS THAN (2023) ,
    PARTITION pmax VALUES LESS THAN MAXVALUE
);

What's the best SELECT to run on that table?

A:

SELECT otherStuff FROM demo WHERE amount > 10 AND year = 2022 AND cycle = 1;

B:

SELECT otherStuff FROM demo (p2022, p1) WHERE amount > 10;

or

C:

SELECT otherStuff FROM demo (p2022, p1) WHERE amount > 10 AND year = 2022 AND cycle = 1;

I'm sure that there is some extra overhead in pruning—some preliminary step for the storage engine to take to figure out which partitions match the WHERE clause. But, where only one partition and subpartition match the WHERE clause and the pruning WHERE clause contains only simple equals comparisons, what I'm trying to figure out is whether the extra overhead is nominal for performance. The reason I want to figure that out is because I want to know if I can get away with pruning, which offers an advantage in design: if I ever wanted to, I could get rid of my partitions and have no queries to change. In other words, explicit partition selection introduces a dependency I'd rather avoid.

Thanks.

CodePudding user response:

None of the above. That is, "A", but without any partitioning.

Get rid of partitioning unless you can show some use for it.

Only in certain applications does PARTITON help with performance. I have never found a performance use for SUBPARTITION.

WHERE amount > 10 AND year = 2022 AND cycle = 1

That is best handled by

INDEX(year, cycle,  -- in either order
      amount)       -- put 'range' after '='

Partitioning would not help this query.

Time Series

A "time series" can be stored in a partitioned table where each partition is a week or month (or other time range). However, the only advantage comes when you get ready to Delete or Archive "old" rows.

DROP PARTITION is much faster and less invasive than the equivalent DELETE. However, it assumes that the oldest "week" can be jettisoned entirely.

Meanwhile, there is no performance benefit to SELECTs. Think of it this way. Partition pruning will pick (perhaps) one partition to look in, then the index takes over. But pruning is not "free". Nor is walking down a BTree. The BTree might be one level shallower because the partitioning serves for one level of "tree". But that just implies that the SELECT is trading off one search mechanism for another -- possibly without any performance change.

More on time series and how to Partition for such: http://mysql.rjweb.org/doc.php/partitionmaint That also covers how to efficiently create the 'next' partition as time goes on.

If you don't want to DROP the old partition, but want to "archive" it, then partitioning facilitates "transportable tablespaces", where the partition is removed from the main table and turned into a table by itself. Then that can be 'transported' to somewhere else. Again, that only applies to a complete partition, hence the rows being moved must align with the PARTITION BY ... being used.

Other uses for Partitioning

See the above link; I have found only 4 other cases; they are more obscure than Time Series.

Covering indexes

Indexing is too complex to make many general statements. If the covering index has two columns that are both being tested with a range (eg, BETWEEN), the query is destined to be inefficient. Essentially a BTree index can deal with only one range. This leads to a rarely seen use for Partitioning -- use partition pruning for one "range" and an Index for the other.

Finding "nearby" places on a globe can use that two-dimensional lookup with PARTITION BY RANGE(latitude) with longitude in the index.

I don't see this trick being viable beyond 2 ranges.

Back to "covering"... If the WHERE clause using a covering index has multiple ranges, there are still performance issues.

Another thing about "covering" indexes -- sometimes they are unwieldy because of having "too many" columns. I use the Rule of Thumb that says "don't put more than 5 columns in an INDEX". (This is a very soft rule; there is nothing magical about "5".)

Optimal index(es)

We can discuss one query at a time, but that is not sufficient. A table is usually hit by many different Selects. To find the optimal indexes, we need to see all the main queries at once.

If one Select begs for INDEX(a) and another begs for INDEX(a,b), having both indexes is counterproductive. It is better to get rid of the shorter one.

My recommendation above suggests either (year, cycle, amount) or (cycle, year, amount). Possibly another query would pick between them. Or, maybe there is enough variety in the queries to require both variations.

More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

  • Related