What is the fastest way to count rows by date in mySQL?-CodePudding

I am trying to count the rows having created_at >= 2019-03-01.

created_at format is like 2018-03-18T14:37:35.000Z

I noticed that the table uses indexing as follows : but the created_at doesn't have an index :'( how can i use the current indexes to speed it up ?

{
    "records": [
        {
            "Table": "bugs",
            "Non_unique": 0,
            "Key_name": "PRIMARY",
            "Seq_in_index": 1,
            "Column_name": "id",
            "Collation": "A",
            "Cardinality": 9791826,
            "Sub_part": null,
            "Packed": null,
            "Null": "",
            "Index_type": "BTREE",
            "Comment": "",
            "Index_comment": ""
        },
        {
            "Table": "bugs",
            "Non_unique": 1,
            "Key_name": "index_bugs_on_category_and_token_and_reported_at",
            "Seq_in_index": 1,
            "Column_name": "category",
            "Collation": "A",
            "Cardinality": 1,
            "Sub_part": null,
            "Packed": null,
            "Null": "YES",
            "Index_type": "BTREE",
            "Comment": "",
            "Index_comment": ""
        },
        {
            "Table": "bugs",
            "Non_unique": 1,
            "Key_name": "index_bugs_on_category_and_token_and_reported_at",
            "Seq_in_index": 2,
            "Column_name": "token",
            "Collation": "A",
            "Cardinality": 29946,
            "Sub_part": null,
            "Packed": null,
            "Null": "YES",
            "Index_type": "BTREE",
            "Comment": "",
            "Index_comment": ""
        },
        {
            "Table": "bugs",
            "Non_unique": 1,
            "Key_name": "index_bugs_on_category_and_token_and_reported_at",
            "Seq_in_index": 3,
            "Column_name": "reported_at",
            "Collation": "A",
            "Cardinality": 6085027,
            "Sub_part": null,
            "Packed": null,
            "Null": "YES",
            "Index_type": "BTREE",
            "Comment": "",
            "Index_comment": ""
        }
    ]
}

I tried to do : SELECT count(id) FROM table WHERE created_at >= '20190301'; but it always gives server timeout.

it's worth mentioning that counting rows doesn't timeout the server : SELECT count(*) from bugs;

output :

    "records": [
        {
            "count(*)": 10049501
        }
     ]
   }

CodePudding user response：

If your question is why is the (*) does not time out the answer is this -- in order to count the (id) it has to validate that the id column does not contain null. null is not counted by the count statement. Since the table has 10 million rows it does not finish before the timeout.

CodePudding user response：

Indexing created_at is the most important thing you can do to optimize your query.

If created_at is not indexed, it will have to scan the whole table. If you run explain SELECT count(id) FROM table WHERE created_at >= '20190301' you will see something like this:

id  select_type     table   partitions  type    possible_keys   key     key_len     ref     rows    filtered    Extra
1   SIMPLE          test                ALL                                                 123     100.00      Using where

ALL meaning it must scan all 10 million rows.

With an index it can quickly find only those rows where created_at >= '20190301'.

id  select_type     table   partitions  type    possible_keys   key         key_len     ref     rows    filtered    Extra
1   SIMPLE          test                range   created_at      created_at  4                   1       100.00      Using where; Using index

However, if created_at >= '20190301' would be a large portion of the table it might decide scanning the whole table is faster than using the index.

The second thing you can do is to make sure created_at is a datetime type, not varchar. This will store the date as a number and the comparison will be faster. This will not improve performance nearly as much as an index will, but it will help. More importantly, it will make it much easier to work with the values and you can take advantage of date and time functions.