Having a table such as follows:
CREATE TABLE Associations (
obj_id int unsigned NOT NULL,
attr_id int unsigned NOT NULL,
assignment Double NOT NULL
PRIMARY KEY (`obj_id`, `attr_id`),
);
this should occupy 16 bytes per row. So the overhead per row is small.
I need to use this as a look up table where the main query would be:
SELECT WHERE obj_id IN (... thousands and thousands of ids....).
Taking these into account along with the fact that the table will be ~500 million rows, is there anything more to consider for good performance?
The table with this number of rows would occupy ~8GB which seems reasonable size in general.
Is there any further improvements to do here?
CodePudding user response:
Using IN()
predicates with thousands of id's has some limitations. The possible outcome is that the optimizer gives up trying to analyze such a long list, and falls back to doing a table-scan of all 500 million rows.
You should read about range_optimizer_max_mem_size
here: https://dev.mysql.com/doc/refman/8.0/en/range-optimization.html#range-optimization-memory-use
Sometimes it's better to load the thousands of id's into a temporary table and do a join from that temp table to your associations table. You should test it to see where the breakpoint is, given your data and your version of MySQL.
CodePudding user response:
Assuming you build another table with the obj_ids (see Bill's Answer), here is something else that will factor into the performance.
If the buffer_pool_size is less that the data actually needed to satisfy the query, there could be extra I/O thrashing.
If the obj_ids are "adjacent" values and there are not many rows for each obj_id, then there could be multiple results found in each block. This could decrease the I/O and could decrease the pressure on the cache (buffer_pool). A block is 16KB. As a Rule of Thumb, a block holder 100 rows. In your case, it is probably around 400.
So, if there are typically more than 400 rows per one obj_id, each obj_id would read to one or more blocks being required.
If, on the other hand the obj_id values were clustered and only a few rows for each, the one block provides multiple results.
What will you do with "thousands and thousands" of rows? Maybe some for of chunking would be beneficial -- Let's see that happens next.