Home > OS >  Number of rows vs size vs query pattern for efficient access
Number of rows vs size vs query pattern for efficient access

Time:11-10

Having a table such as follows:

CREATE TABLE Associations (
  obj_id int unsigned NOT NULL,
  attr_id int unsigned NOT NULL,
  assignment Double NOT NULL
  PRIMARY KEY (`obj_id`, `attr_id`),
);   

this should occupy 16 bytes per row. So the overhead per row is small.
I need to use this as a look up table where the main query would be:

SELECT WHERE obj_id IN (... thousands and thousands of ids....). 

Taking these into account along with the fact that the table will be ~500 million rows, is there anything more to consider for good performance? The table with this number of rows would occupy ~8GB which seems reasonable size in general.
Is there any further improvements to do here?

CodePudding user response:

Using IN() predicates with thousands of id's has some limitations. The possible outcome is that the optimizer gives up trying to analyze such a long list, and falls back to doing a table-scan of all 500 million rows.

You should read about range_optimizer_max_mem_size here: https://dev.mysql.com/doc/refman/8.0/en/range-optimization.html#range-optimization-memory-use

Sometimes it's better to load the thousands of id's into a temporary table and do a join from that temp table to your associations table. You should test it to see where the breakpoint is, given your data and your version of MySQL.

CodePudding user response:

Assuming you build another table with the obj_ids (see Bill's Answer), here is something else that will factor into the performance.

If the buffer_pool_size is less that the data actually needed to satisfy the query, there could be extra I/O thrashing.

If the obj_ids are "adjacent" values and there are not many rows for each obj_id, then there could be multiple results found in each block. This could decrease the I/O and could decrease the pressure on the cache (buffer_pool). A block is 16KB. As a Rule of Thumb, a block holder 100 rows. In your case, it is probably around 400.

So, if there are typically more than 400 rows per one obj_id, each obj_id would read to one or more blocks being required.

If, on the other hand the obj_id values were clustered and only a few rows for each, the one block provides multiple results.

What will you do with "thousands and thousands" of rows? Maybe some for of chunking would be beneficial -- Let's see that happens next.

  • Related