Home > front end >  Is an auto inc primary key needed here to avoid fragmentation?
Is an auto inc primary key needed here to avoid fragmentation?

Time:11-08

I have a table such as follows:

CREATE TABLE Associations (
  obj_id int unsigned NOT NULL,
  attr_id int unsigned NOT NULL,
  assignment Double NOT NULL
  PRIMARY KEY (`obj_id`, `attr_id`),
);

Now the insertion order for the rows is/will be random. Would such a definition lead to fragmentation of the table? Should I be adding an auto inc primary key or would that only speed up the insert and would not help the speed of SELECT queries?
What would a better table definition be for random inserts?

Note, that performance wise I am more interested in SELECT than INSERT

CodePudding user response:

(Assuming you are using ENGINE=InnoDB.)

Short answer: Do not fret about fragmentation.

Long answer:

There are two types of "fragmentation" -- Which one bothers you?

  • BTree blocks becoming less than full.

  • Blocks becoming scattered around the disk.

  • If you have an SSD disk, the scattering of blocks around the disk has no impact on performance. For HDD, it matters some, but still not enough to get very worried about.

  • Fragmentation does not "run away". If two adjacent blocks are seen to be relatively empty, they are combined. Result: The "average" block is about 69% full.

  • In your particular example, when you want multiple "attributes" for one "object", they will be found "clustered". That is they will be mostly in the same block, hence a bit faster to access. Adding id AUTO_INCREMENT PRIMARY KEY would slow SELECTs/UPDATEs down.

  • Another reason why an id would slow down SELECTs is that SELECT * FROM t WHERE obj_id=... needs to first find the item in the index, then reach into the data for the other columns. With PRIMARY KEY(obj_id, ...), there is no need for this extra hop. (In some situations, this is a big speedup.)

  • OPTIMIZE TABLE takes time and blocks access while you are running it.

  • Even after OPTIMIZE, fragmentation comes back -- for a variety of reasons.

  • "Fill factor" is virtually useless -- UPDATE and DELETE store extra copies of rows pending COMMIT. This leads to block splits (aka page splits) if fill_factor is too high or sparse blocks if too low. That is, it is too hard to be worth trying to tune.

  • Fewer indexes means less disk space, etc. You probably need an index on (obj_id, attr_id) whether or not you also have (id). So, why waste space when it does not help?

The one case where OPTIMIZE TABLE can make a noticeable difference is after you delete lots of rows. I discuss several ways to avoid this issue here: http://mysql.rjweb.org/doc.php/deletebig

CodePudding user response:

I guess you use the InnoDB access method. InnoDB stores its data in a so-called clustered index. That is, all the data is stashed away behind the BTREE primary key.

Read this for background.

When you insert a row, you're inserting it into the BTREE structure. To oversimplify, BTREEs are made up of elaborately linked pages accessible in order. That means your data goes into some page somewhere. When you insert data in primary-key order, the data goes into a page at the end of the BTREE. So, when a page fills up, InnoDB just makes another one and puts your data there.

But, when you insert in some other order, often your row must go between other rows in an existing BTREE page. If the page has enough free space, InnoDB can drop your data into it. But, if the page does not have enough space, InnoDB must do a page split. It makes two pages from one, and puts your new row into one of the two.

Doing inserts in some order other than index order causes more page splits. That's why it doesn't perform as well. The classic example is building a table with a UUIDv4 (random) primary key column.

Now, you asked about autoincrementing primary keys. If you have such a key in your InnoDB table, all (or almost all) your INSERTs go into the last page of the clustered index, so you don't get the page split overhead. Cool.

But, if you need an index on some other column or columns that aren't in your INSERT order, you'll get page splits in that secondary index. The entries in secondary indexes are often smaller than the ones in clustered indexes, so you get fewer page splits. But you still get them.

Some DBMSs, but not MySQL, let you declare FILL_PERCENT(50) or something in both clustered and secondary indexes. That's useful for out-of-order loads because your can make your pages start out with less space already used, so you get fewer page splits. (Of course, you use more RAM and SSD with lower fill factors.)

MySQL doesn't have FILL_FACTOR in its data definition language. It does have a global systemwide variable called innodb_fill_factor. It is a percentage number. Its default is 100, which actually means 1/16th of each page is left unused.

If you know you have to do a big out-of-index-order bulk load you can give this command first to leave 60% of each new page available, to reduce page splits.

SET GLOBAL innodb_fill_factor = 40;

But beware, this is a system-wide setting. It will apply to everything on your MySQL server. You might want to put it back when done to save RAM and SSD space in production.

Finally, OPTIMIZE TABLE tablename; can reorganize tables that have had a lot of page splits to clean them up. (In InnoDB the OPTIMIZE command actually maps to ALTER TABLE tablename FORCE; ANALYZE TABLE tablename;.) It can take a while, so beware.

When you OPTIMIZE, InnoDB remakes the pages to bring their fill percentages near to the number you set in the system variable.

Unless you're doing a really vast bulk load on a vast table, my advice is to not worry about all this fill percentage business. Design your table to match your application and don't look back.

When you're done with any bulk load you can, if you want, do OPTIMIZE TABLE to get rid of any gnarly page splits.

Edit Your choice of primary key is perfect for your queries' WHERE pattern obj_id IN (val, val, val). Don't change that primary key, especially not to an autoincrementing one.

Pro tip It's tempting to try to forsee scaling problems in the early days of an app's lifetime. And there's no harm in it. But in the case of SQL databases, it's really hard to forsee the actual query patterns that will emerge as your app scales up. Fortunately, SQL's designed so you can add and tweak indexes as you go. You don't have to achieve performance perfection on day 1. So, my advice: think about this issue, but avoid overthinking it. With respect, you're starting to overthink it.

  • Related