Home > database >  mysql index selection on large table
mysql index selection on large table

Time:12-31

I have a couple of tables that looks like this:

CREATE TABLE Entities (
   id INT NOT NULL AUTO_INCREMENT,
   name VARCHAR(45) NOT NULL,
   client_id INT NOT NULL,
   display_name VARCHAR(45),
   PRIMARY KEY (id)
)

CREATE TABLE Statuses (
   id INT NOT NULL AUTO_INCREMENT,
   name VARCHAR(45) NOT NULL,
   PRIMARY KEY (id)
)

CREATE TABLE EventTypes (
   id INT NOT NULL AUTO_INCREMENT,
   name VARCHAR(45) NOT NULL,
   PRIMARY KEY (id)
)

CREATE TABLE Events (
   id INT NOT NULL AUTO_INCREMENT,
   entity_id INT NOT NULL,
   date DATE NOT NULL,
   event_type_id INT NOT NULL,
   status_id INT NOT NULL
)

Events is large > 100,000,000 rows

Entities, Statuses and EventTypes are small < 300 rows a piece

I have several indexes on Events, but the ones that come into play are

idx_events_date_ent_status_type (date, entity_id, status_id, event_type_id)
idx_events_date_ent_status_type (entity_id, status_id, event_type_id)
idx_events_date_ent_type (date, entity_id, event_type_id)

I have a large complicated query, but I'm getting the same slow query results with a simpler one like the one below (note, in the real queries, I don't use evt.*)

SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt
   JOIN `Entities` ent ON evt.entity_id = ent.id
   JOIN `EventTypes` et ON evt.event_type_id = et.id
   JOIN `Statuses` s ON evt.status_id = s.id
WHERE
   evt.date BETWEEN @start_date AND @end_date AND
   evt.entity_id IN ( 19 ) AND -- this in clause is built by code
   evt.event_type_id = @type_id

For some reason, mysql keeps choosing the index which doesn't cover Events.date and the query takes 15 seconds or more and returns a couple thousand rows. If I change the query to:

SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt force index (idx_events_date_ent_status_type)
   JOIN `Entities` ent ON evt.entity_id = ent.id
   JOIN `EventTypes` et ON evt.event_type_id = et.id
   JOIN `Statuses` s ON evt.status_id = s.id
WHERE
   evt.date BETWEEN @start_date AND @end_date AND
   evt.entity_id IN ( 19 ) AND -- this in clause is built by code
   evt.event_type_id = @type_id

The query takes .014 seconds.

Since this query is built by code, I would much rather not force the index, but mostly, I want to know why it chooses one index over the other. Is it because of the joins?

To give some stats, there are ~2500 distinct dates, and ~200 entities in the Events table. So I suppose that might be why it chooses the index with all of the low cardinality columns.

Do you think it would help to add date to the end of idx_events_date_ent_status_type? Since this is a large table, it takes a long time to add indexes.

I tried adding an additional index, ix_events_ent_date_status_et(entity_id, date, status_id, event_type_id) and it actually made the queries slower.

I will experiment a bit more, but I feel like I'm not sure how the optimizer makes it's decisions.

Additional Info:

I tried removing the join to the Statuses table, and mysql switches to ix_events_date_ent_type, and the query runs in 0.045 sec

I can't wrap my head around why removing a join to a table that is not part of the filter impacts the choice of index.

CodePudding user response:

I would add this index:

ALTER TABLE Events ADD INDEX (event_type_id, entity_id, date);

The order of columns is important. Put all column(s) used in equality conditions first. This is event_type_id in this case.

The optimizer can use multiple columns to optimize equalities, if the columns are left-most and consecutive.

Then the optimizer can use one more column to optimize a range condition. A range condition is anything other than = or IS NULL. So range conditions include >, !=, BETWEEN, IN(), LIKE (with no leading wildcard), IS NOT NULL, and so on.

The condition on entity_id is also an equality condition if the IN() list has one element. MySQL's optimizer can treat a list of one value as an equality condition. But if the list has more than one value, it becomes a range condition. So if the example you showed of IN (19) is typical, then all three columns of the index will be used for filtering.

It's still worth putting date in the index, because it can at least tell the InnoDB storage engine to filter rows before returning them. See https://dev.mysql.com/doc/refman/8.0/en/index-condition-pushdown-optimization.html It's not quite as good as a real index lookup, but it's worthwhile.

I would also suggest creating a smaller table to test with. Doing experiments on a 100 million row table is time-consuming. But you do need a table with a non-trivial amount of data, because if you test on an empty table, the optimizer behaves differently.

CodePudding user response:

To most effectively use a composite index on multiple values of two different fields, you need to specify the values with joins instead of simple where conditions. So assuming you are selecting dates from 2022-12-01 to 2022-12-03 and entity_id in (1,2,3), do:

select ...
from (select date('2022-12-01') date union all select date('2022-12-02') union all select date('2022-12-03')) dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date

If you pre-create a dates table with all dates from 0000-01-01 to 9999-12-31, then you can do:

select ...
from dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date
where dates.date between @start_date and @end_date
  • Related