Home > OS >  SQL MAX on primary key, is filter condition unncessary if it is already indexed?
SQL MAX on primary key, is filter condition unncessary if it is already indexed?

Time:11-06

select MAX(id) from studenthistory 
where class_id = 1 
   and date(created_at) = '2021-11-05' 
   and time(created_at) > TIME('04:00:00') 
group by student_id

composite indexes = ("class_id", "student_id", "created_at")

id is the primary key.

Is date(created_at) = '2021-11-05' and time(created_at) > TIME('04:00:00') filter condition unnecessary for Max function since studenthistory is already indexed on class_id and student_id?

The only reason I added that datetime filter is because this table will get huge over time. (historical data) And I wanted to reduce the number of rows the query has to search.

But for the case of Max function - I believe MAX would simply fetch the last value without checking the whole row, if it is indexed.

So can i safely remove the datetime filter and turn it into

select MAX(id) from studenthistory 
where class_id = 1 
group by student_id

And have the same performance? (or better since it does not need to filter further?)

Checking the query plan seems like the performance is similar, but the size of the table is rather small as of now..

First:

| -> Group aggregate: max(id)  (cost=1466.30 rows=7254) (actual time=2.555..5.766 rows=3 loops=1)
    -> Filter: ((cast(studenthistory.created_at as date) = '2021-11-05') and (cast(riderlocation.created_at as time(6)) > <cache>(cast('04:00:00' as time))))  (cost=740.90 rows=7254) (actual time=0.762..5.384 rows=5349 loops=1)
        -> Index lookup on studenthistory using idx_studenthistory_class_id_931474 (class_id=1)  (cost=740.90 rows=7254) (actual time=0.029..3.589 rows=14638 loops=1)
 |

1 row in set (0.00 sec)

Second:

| -> Group aggregate: max(studenthistory.id)  (cost=1475.40 rows=7299) (actual time=0.545..5.271 rows=10 loops=1)
    -> Index lookup on studenthistory using idx_studenthistory_class_id_931474 (class_id=1)  (cost=745.50 rows=7299) (actual time=0.026..4.164 rows=14729 loops=1)
 |
1 row in set (0.01 sec)

Many thanks in advance


UPDATE: applying @rick james's suggestion:

Changed index to (class_id, student_id, id).

FLUSH STATUS;
explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory` WHERE `class_id`=1 AND `created_at`>='2021-11-05T18:25:50.544850 00:00' GROUP BY `student_id`;


| {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "940.10"
    },
    "grouping_operation": {
      "using_filesort": false,
      "table": {
        "table_name": "studenthistory",
        "access_type": "ref",
        "possible_keys": [
          "fk_studenthist_student_e25b0310",
          "idx_studenthistory_class_id_931474"
        ],
        "key": "idx_studenthistory_class_id_931474",
        "used_key_parts": [
          "class_id"
        ],
        "key_length": "4",
        "ref": [
          "const"
        ],
        "rows_examined_per_scan": 8381,
        "rows_produced_per_join": 2793,
        "filtered": "33.33",
        "cost_info": {
          "read_cost": "102.00",
          "eval_cost": "279.34",
          "prefix_cost": "940.10",
          "data_read_per_join": "130K"
        },
        "used_columns": [
          "id",
          "created_at",
          "student_id",
          "class_id"
        ],
        "attached_condition": "(`test-table`.`studenthistory`.`created_at` >= TIMESTAMP'2021-11-05 18:25:50.54485')"
      }
    }
  }
} |

i.e. only class_id is used as an index, (as created_at is no longer in the index. rows_produced_per_join is lower due to filter: 2793,

Without datetime filter:

FLUSH STATUS;
mysql> explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `riderlocation` WHERE `zone_id`=1  GROUP BY `rider_id`;


| {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "854.75"
    },
    "grouping_operation": {
      "using_filesort": false,
      "table": {
        "table_name": "studenthistory",
        "access_type": "ref",
        "possible_keys": [
          "fk_studenthistory_student_e25b0310",
          "idx_studenthistory_class_id_931474"
        ],
        "key": "idx_studenthistory_class_id_931474",
        "used_key_parts": [
          "class_id"
        ],
        "key_length": "4",
        "ref": [
          "const"
        ],
        "rows_examined_per_scan": 8381,
        "rows_produced_per_join": 8381,
        "filtered": "100.00",
        "using_index": true,
        "cost_info": {
          "read_cost": "16.65",
          "eval_cost": "838.10",
          "prefix_cost": "854.75",
          "data_read_per_join": "392K"
        },
        "used_columns": [
          "id",
          "student_id",
          "class_id"
        ]
      }
    }
  }
} |

Runs on all 3 indexes ("class_id", "student_id", "id"), same 8381 number of rows slightly lower query cost (940 -> 854)

Applying the first query with original index ("class_id", "student_id", "created_at") yields:

FLUSH STATUS;
    explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory` WHERE `class_id`=1 AND `created_at`>='2021-11-05T18:25:50.544850 00:00' GROUP BY `student_id`;

| {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "858.94"
    },
    "grouping_operation": {
      "using_filesort": false,
      "table": {
        "table_name": "studenthistory",
        "access_type": "ref",
        "possible_keys": [
          "fk_studenthistory_student_e25b0310",
          "idx_studenthistory_class_id_931474"
        ],
        "key": "idx_studenthistory_class_id_931474",
        "used_key_parts": [
          "zone_id"
        ],
        "key_length": "4",
        "ref": [
          "const"
        ],
        "rows_examined_per_scan": 8381,
        "rows_produced_per_join": 2793,
        "filtered": "33.33",
        "using_index": true,
        "cost_info": {
          "read_cost": "20.84",
          "eval_cost": "279.34",
          "prefix_cost": "858.94",
          "data_read_per_join": "130K"
        },
        "used_columns": [
          "id",
          "created_at",
          "student_id",
          "class_id"
        ],
        "attached_condition": "(`test-table`.`studenthistory`.`created_at` >= TIMESTAMP'2021-11-05 18:25:50.54485')"
      }
    }
  }
} |

The cost this time is 858, rows "rows_examined_per_scan": 8381, "rows_produced_per_join": 2793. Only class_id was used as key however. (why.?) not the remaining student_id and created_at

CodePudding user response:

Query 1

select MAX(id) from studenthistory 
where class_id = 1 
   and date(created_at) = '2021-11-05' 
   and time(created_at) > TIME('04:00:00') 
group by student_id

Don't split up the date; change to

AND created_at > '2021-11-05 04:00:00'

If you want to check rows that were 'created' on the day, use something

AND created_at >= '2021-11-05'
AND created_at  < '2021-11-05'   INTERVAL 1 DAY

Or, if you want to check for "today":

AND created_at >= CURDATE()

After 4am this morning:

AND created_at >= CURDATE()   INTERVAL 4 HOUR

Using date(created_at) makes the created_at part of the INDEX unusable. (cf "sargable")

select MAX(id) ... group by student_id

Is likely to return multiple rows -- one per student. Perhaps you want to get rid of the group by? Or specify a particular student_id?

Query 2 may run faster:

select MAX(id) from studenthistory 
where class_id = 1 
group by student_id

But the optimal index is INDEX(class_id, student_id, id), (It is OK to include both composite indexes.)

It may return multiple rows, so perhaps you want

select student_id, MAX(id) from studenthistory 
where class_id = 1 
group by student_id

MAX

I believe MAX would simply fetch the last value without checking the whole row, if it is indexed.

Sometimes.

Your second query can do that. But the first query cannot -- because of the range test (on created_at) being in the way.

EXPLAIN

query plan seems ... similar

Alas, EXPLAIN leaves details out. You can get some more details with EXPLAIN FORMAT=JSON SELECT ..., but not necessarily enough details.

I think you will find that the second query will give a much smaller value for "Rows" after adding my suggested index.

A way to get an accurate measure of "rows (table or index) touched":

FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';

CodePudding user response:

The query:

select MAX(id) from studenthistory 
where class_id = 1 
group by student_id

Can be fast if you create the index:

create index ix1 on studenthistory (class_id, student_id, id);
  • Related