Optimize selecting all rows from a table based on results from the same table?-CodePudding

I'll be the first to admit that I'm not great at SQL (and I probably shouldn't be treating it like a rolling log file), but I was wondering if I could get some pointers for improving some slow queries...

I have a large mysql table with 2M rows where I do two full table lookups based on a subset of the most recent data. When I load the page that contains these queries, I often find they take several seconds to complete, but the queries inside are quite quick.

PMA's (supposedly terrible) advisor pretty much throws the entire kitchen sink at me, temporary tables, too many sorts, joins without indexes (I don't even have any joins?), reading from fixed position, reading next position, temporary tables written to disk... that last one especially makes me wonder if it's a configuration issue, but I played with all the knobs, and even paid for a managed service which didn't seem to help.

CREATE TABLE `archive` (
  `id` bigint UNSIGNED NOT NULL,
  `ip` varchar(15) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  `service` enum('ssh','telnet','ftp','pop3','imap','rdp','vnc','sql','http','smb','smtp','dns','sip','ldap') CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  `hostid` bigint UNSIGNED NOT NULL,
  `date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

ALTER TABLE `archive`
  ADD PRIMARY KEY (`id`),
  ADD KEY `service` (`service`),
  ADD KEY `date` (`date`),
  ADD KEY `ip` (`ip`),
  ADD KEY `date-ip` (`date`,`ip`),
  ADD KEY `date-service` (`date`,`service`),
  ADD KEY `ip-date` (`ip`,`date`),
  ADD KEY `ip-service` (`ip`,`service`),
  ADD KEY `service-date` (`service`,`date`),
  ADD KEY `service-ip` (`service`,`ip`);

Adding indexes definitely helped (even though they're 4x the size of the actual data), but I'm kindof at a loss where I can optimize further. Initially I thought about caching the subquery results in php and using it twice for the main queries, but I don't think I have access to the result once I close the subquery. I looked into doing joins, but they look like they're meant for 2 or more separate tables, but the subquery is from the same table, so I'm not sure if that would even work either. The queries are supposed to find the most active ip/services based on whether I have data from an ip in the past 24 hours...

SELECT service, COUNT(service) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY service HAVING total > 1
ORDER BY total DESC, service ASC LIMIT 10

 ---- -------------- ----------------- ------------ ------- ---------------------------------------------------------------------------- ------------ --------- ------------------------ ------- ---------- --------------------------------- 
| id | select_type  | table           | partitions | type  | possible_keys                                                              | key        | key_len | ref                    | rows  | filtered | Extra                           |
 ---- -------------- ----------------- ------------ ------- ---------------------------------------------------------------------------- ------------ --------- ------------------------ ------- ---------- --------------------------------- 
|  1 | SIMPLE       | <subquery2>     | NULL       | ALL   | NULL                                                                       | NULL       | NULL    | NULL                   |  NULL |   100.00 | Using temporary; Using filesort |
|  1 | SIMPLE       | archive         | NULL       | ref   | service,ip,date-service,ip-date,ip-service,service-date,service-ip         | ip-service | 47      | <subquery2>.ip         |     5 |   100.00 | Using index                     |
|  2 | MATERIALIZED | archive         | NULL       | range | date,ip,date-ip,date-service,ip-date,ip-service                            | date-ip    | 5       | NULL                   | 44246 |   100.00 | Using where; Using index        |
 ---- -------------- ----------------- ------------ ------- ---------------------------------------------------------------------------- ------------ --------- ------------------------ ------- ---------- ---------------------------------

SELECT ip, COUNT(ip) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY ip HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10

 ---- -------------- ----------------- ------------ ------- --------------------------------------------------------------- --------- --------- ------------------------ ------- ---------- --------------------------------- 
| id | select_type  | table           | partitions | type  | possible_keys                                                 | key     | key_len | ref                    | rows  | filtered | Extra                           |
 ---- -------------- ----------------- ------------ ------- --------------------------------------------------------------- --------- --------- ------------------------ ------- ---------- --------------------------------- 
|  1 | SIMPLE       | <subquery2>     | NULL       | ALL   | NULL                                                          | NULL    | NULL    | NULL                   |  NULL |   100.00 | Using temporary; Using filesort |
|  1 | SIMPLE       | archive         | NULL       | ref   | ip,date-ip,ip-date,ip-service,service-ip                      | ip-date | 47      | <subquery2>.ip         |     5 |   100.00 | Using index                     |
|  2 | MATERIALIZED | archive         | NULL       | range | date,ip,date-ip,date-service,ip-date,ip-service               | date-ip | 5       | NULL                   | 44168 |   100.00 | Using where; Using index        |
 ---- -------------- ----------------- ------------ ------- --------------------------------------------------------------- --------- --------- ------------------------ ------- ---------- ---------------------------------

common subquery: 0.0351s

whole query 1: 1.4270s

whole query 2: 1.5601s

total page load: 3.050s (7 queries total)

Am I just doomed to terrible performance with this table?

Hopefully there's enough information here to get an idea of what's going, but if anyone can help I would certainly appreciate it. I don't mind throwing more hardware at the issue, but when an 8c/16t server with 16gb can't handle 150mb of data I'm not sure what will. Thanks in advance for reading my long winded question.

CodePudding user response：

You have the right indexes (as well as many other indexes) and your query both meets your specs and runs close to optimally. It's unlikely that you can make this much faster: it needs to look all the way back to the beginning of your table.

If you can change your spec so you only have to look back a limited amount of time like a year you'll get a good speedup.

Some possible minor tweaks.

use the latin1_bin collation for your ip and service columns. It uses 8-bit characters and collates them without any case sensitivity. That's plenty for IPv4 dotted-quad addresses (and IPv6 addresses). You'll get rid of a bit of overhead in matching and grouping. Or, even better,
If you know you will have nothing but IPv4 addresses, rework your ip column to store their binary representations ( that is, the INET_ATON() - generated value of each IPv4). You can fit those in the UNSIGNED INT 32-bit integer data type, making the lookup, grouping, and ordering even faster.

It's possible to rework the way you gather these data. For example, you could arrange to gather at most one row per service per day. That will reduce the timeseries resolution of your data, but it will also make your queries much faster. Define your table something like this:

CREATE TABLE archive2 (
  ip      VARCHAR(15) COLLATE latin1_bin NOT NULL,
  service ENUM ('ssh','telnet','ftp',
                'pop3','imap','rdp',
                'vnc','sql','http','smb',
                'smtp','dns','sip','ldap') COLLATE latin1_bin NOT NULL,
  `date`  DATE NOT NULL,
  `count` INT NOT NULL,
   hostid bigint UNSIGNED NOT NULL,
   PRIMARY KEY (`date`, ip, service)
) ENGINE=InnoDB;

Then, when you insert a row, use this query:

 INSERT INTO archive2 (`date`, ip, service, `count`, hostid)
               VALUES (CURDATE(), ?ip, ?service, 1, ?hostid)
ON DUPLICATE KEY UPDATE
              SET count = count   1;

This will automatically increment your count column if the row for the ip, service, and date already exists.

Then your second query will look like:

SELECT ip, SUM(`count`) AS total
  FROM archive 
 WHERE ip IN  (
           SELECT ip FROM archive 
            WHERE `date` > CURDATE() - INTERVAL 1 DAY
            GROUP BY ip
            HAVING total > 1
        )
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10;

The index of the primary key will satisfy this query.

CodePudding user response：

First query

(I'm not convinced that it can be made much faster.)

(currently)

SELECT  service, COUNT(service) AS total
    FROM  `archive`
    WHERE  ip IN (
        SELECT  DISTINCT ip
            FROM  `archive`
            WHERE  date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
                 )
    GROUP BY  service
    HAVING  total > 1
    ORDER BY  total DESC, service ASC
    LIMIT  10

Notes:

COUNT(service) --> COUNT(*)
DISTINCT is not needed in IN (SELECT DISTINCT ...)
IN ( SELECT ... ) is often slow; rewrite using EXISTS ( SELECT 1 ... ) or JOIN (see below)
INDEX(date, IP) -- for subquery
INDEX(service, IP) -- for your outer query
INDEX(IP, service) -- for my outer query
Toss redundant indexes; they can get in the way. (See below)
It will have to gather all the possible results before getting to the ORDER BY and LIMIT. (That is, LIMIT has very little impact on performance for this query.)
CHARACTER SET utf8 COLLATE utf8_unicode_ci is gross overkill for IP addresses; switch to CHARACTER SET ascii COLLATE ascii_bin.
If you are running MySQL 8.0 (Or MariaDB 10.2), a WITH to calculate the subquery once, together with a UNION to compute the two outer queries, may provide some extra speed.
MariaDB has a "subquery cache" that might have the effect of skipping the second subquery evaluation.
By using DATETIME instead of TIMESTAMP, you will two minor hiccups per year when daylight savings kicks in/out.
I doubt if hostid needs to be a BIGINT (8-bytes).

To switch to a JOIN, think of fetching the candidate rows first:

SELECT  service, COUNT(*) AS total
    FROM ( SELECT DISTINCT IP
             FROM archive
             WHERE `date` > NOW() - INTERVAL 24 HOUR
         ) AS x
    JOIN archive  USING(IP)
    GROUP BY service
    HAVING total > 1
    ORDER BY  total DESC, service ASC
    LIMIT  10

For any further discussion any slow (but working) query, please provide both flavors of EXPLAIN:

EXPLAIN SELECT ...
EXPLAIN FORMAT=JSON SELECT ...

Drop these indexes:

  ADD KEY `service` (`service`),
  ADD KEY `date` (`date`),
  ADD KEY `ip` (`ip`),

Recommend only

  ADD PRIMARY KEY (`id`),
  -- as discussed:
  ADD KEY `date-ip`      (`date`,`ip`),
  ADD KEY `ip-service`   (`ip`,`service`),
  ADD KEY `service-ip`   (`service`,`ip`),
  -- maybe other queries need these:
  ADD KEY `date-service` (`date`,`service`),
  ADD KEY `ip-date`      (`ip`,`date`),
  ADD KEY `service-date` (`service`,`date`),

The general rule here is that you don't need INDEX(a) when you also have INDEX(a,b). In particular, they may be preventing the use of better indexes; see the EXPLAINs.

Second query

Rewrite it

SELECT  ip, COUNT(DISTINCT ip) AS total
    FROM  `archive`
    WHERE  date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
    GROUP BY  ip
    HAVING  total > 1
    ORDER BY  total DESC, INET_ATON(ip) ASC
    LIMIT  10

It will use only INDEX(date, ip).