Number of rows vs size vs query pattern for efficient access-CodePudding

Having a table such as follows:

CREATE TABLE Associations (
  obj_id int unsigned NOT NULL,
  attr_id int unsigned NOT NULL,
  assignment Double NOT NULL
  PRIMARY KEY (`obj_id`, `attr_id`),
);

this should occupy 16 bytes per row. So the overhead per row is small.
I need to use this as a look up table where the main query would be:

SELECT WHERE obj_id IN (... thousands and thousands of ids....).

Taking these into account along with the fact that the table will be ~500 million rows, is there anything more to consider for good performance? The table with this number of rows would occupy ~8GB which seems reasonable size in general.
Is there any further improvements to do here?

CodePudding user response：

Using IN() predicates with thousands of id's has some limitations. The possible outcome is that the optimizer gives up trying to analyze such a long list, and falls back to doing a table-scan of all 500 million rows.

You should read about range_optimizer_max_mem_size here: https://dev.mysql.com/doc/refman/8.0/en/range-optimization.html#range-optimization-memory-use

Sometimes it's better to load the thousands of id's into a temporary table and do a join from that temp table to your associations table. You should test it to see where the breakpoint is, given your data and your version of MySQL.

CodePudding user response：

Assuming you build another table with the obj_ids (see Bill's Answer), here is something else that will factor into the performance.

If the buffer_pool_size is less that the data actually needed to satisfy the query, there could be extra I/O thrashing.

If the obj_ids are "adjacent" values and there are not many rows for each obj_id, then there could be multiple results found in each block. This could decrease the I/O and could decrease the pressure on the cache (buffer_pool). A block is 16KB. As a Rule of Thumb, a block holder 100 rows. In your case, it is probably around 400.

So, if there are typically more than 400 rows per one obj_id, each obj_id would read to one or more blocks being required.

If, on the other hand the obj_id values were clustered and only a few rows for each, the one block provides multiple results.

What will you do with "thousands and thousands" of rows? Maybe some for of chunking would be beneficial -- Let's see that happens next.