I'll be the first to admit that I'm not great at SQL (and I probably shouldn't be treating it like a rolling log file), but I was wondering if I could get some pointers for improving some slow queries...
I have a large mysql table with 2M rows where I do two full table lookups based on a subset of the most recent data. When I load the page that contains these queries, I often find they take several seconds to complete, but the queries inside are quite quick.
PMA's (supposedly terrible) advisor pretty much throws the entire kitchen sink at me, temporary tables, too many sorts, joins without indexes (I don't even have any joins?), reading from fixed position, reading next position, temporary tables written to disk... that last one especially makes me wonder if it's a configuration issue, but I played with all the knobs, and even paid for a managed service which didn't seem to help.
CREATE TABLE `archive` (
`id` bigint UNSIGNED NOT NULL,
`ip` varchar(15) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`service` enum('ssh','telnet','ftp','pop3','imap','rdp','vnc','sql','http','smb','smtp','dns','sip','ldap') CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`hostid` bigint UNSIGNED NOT NULL,
`date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
ALTER TABLE `archive`
ADD PRIMARY KEY (`id`),
ADD KEY `service` (`service`),
ADD KEY `date` (`date`),
ADD KEY `ip` (`ip`),
ADD KEY `date-ip` (`date`,`ip`),
ADD KEY `date-service` (`date`,`service`),
ADD KEY `ip-date` (`ip`,`date`),
ADD KEY `ip-service` (`ip`,`service`),
ADD KEY `service-date` (`service`,`date`),
ADD KEY `service-ip` (`service`,`ip`);
Adding indexes definitely helped (even though they're 4x the size of the actual data), but I'm kindof at a loss where I can optimize further. Initially I thought about caching the subquery results in php and using it twice for the main queries, but I don't think I have access to the result once I close the subquery. I looked into doing joins, but they look like they're meant for 2 or more separate tables, but the subquery is from the same table, so I'm not sure if that would even work either. The queries are supposed to find the most active ip/services based on whether I have data from an ip in the past 24 hours...
SELECT service, COUNT(service) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY service HAVING total > 1
ORDER BY total DESC, service ASC LIMIT 10
---- -------------- ----------------- ------------ ------- ---------------------------------------------------------------------------- ------------ --------- ------------------------ ------- ---------- ---------------------------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- -------------- ----------------- ------------ ------- ---------------------------------------------------------------------------- ------------ --------- ------------------------ ------- ---------- ---------------------------------
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | archive | NULL | ref | service,ip,date-service,ip-date,ip-service,service-date,service-ip | ip-service | 47 | <subquery2>.ip | 5 | 100.00 | Using index |
| 2 | MATERIALIZED | archive | NULL | range | date,ip,date-ip,date-service,ip-date,ip-service | date-ip | 5 | NULL | 44246 | 100.00 | Using where; Using index |
---- -------------- ----------------- ------------ ------- ---------------------------------------------------------------------------- ------------ --------- ------------------------ ------- ---------- ---------------------------------
SELECT ip, COUNT(ip) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY ip HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10
---- -------------- ----------------- ------------ ------- --------------------------------------------------------------- --------- --------- ------------------------ ------- ---------- ---------------------------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- -------------- ----------------- ------------ ------- --------------------------------------------------------------- --------- --------- ------------------------ ------- ---------- ---------------------------------
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | archive | NULL | ref | ip,date-ip,ip-date,ip-service,service-ip | ip-date | 47 | <subquery2>.ip | 5 | 100.00 | Using index |
| 2 | MATERIALIZED | archive | NULL | range | date,ip,date-ip,date-service,ip-date,ip-service | date-ip | 5 | NULL | 44168 | 100.00 | Using where; Using index |
---- -------------- ----------------- ------------ ------- --------------------------------------------------------------- --------- --------- ------------------------ ------- ---------- ---------------------------------
common subquery: 0.0351s
whole query 1: 1.4270s
whole query 2: 1.5601s
total page load: 3.050s (7 queries total)
Am I just doomed to terrible performance with this table?
Hopefully there's enough information here to get an idea of what's going, but if anyone can help I would certainly appreciate it. I don't mind throwing more hardware at the issue, but when an 8c/16t server with 16gb can't handle 150mb of data I'm not sure what will. Thanks in advance for reading my long winded question.
CodePudding user response:
You have the right indexes (as well as many other indexes) and your query both meets your specs and runs close to optimally. It's unlikely that you can make this much faster: it needs to look all the way back to the beginning of your table.
If you can change your spec so you only have to look back a limited amount of time like a year you'll get a good speedup.
Some possible minor tweaks.
- use the
latin1_bin
collation for yourip
andservice
columns. It uses 8-bit characters and collates them without any case sensitivity. That's plenty for IPv4 dotted-quad addresses (and IPv6 addresses). You'll get rid of a bit of overhead in matching and grouping. Or, even better, - If you know you will have nothing but IPv4 addresses, rework your
ip
column to store their binary representations ( that is, theINET_ATON()
- generated value of each IPv4). You can fit those in theUNSIGNED INT
32-bit integer data type, making the lookup, grouping, and ordering even faster.
It's possible to rework the way you gather these data. For example, you could arrange to gather at most one row per service per day. That will reduce the timeseries resolution of your data, but it will also make your queries much faster. Define your table something like this:
CREATE TABLE archive2 (
ip VARCHAR(15) COLLATE latin1_bin NOT NULL,
service ENUM ('ssh','telnet','ftp',
'pop3','imap','rdp',
'vnc','sql','http','smb',
'smtp','dns','sip','ldap') COLLATE latin1_bin NOT NULL,
`date` DATE NOT NULL,
`count` INT NOT NULL,
hostid bigint UNSIGNED NOT NULL,
PRIMARY KEY (`date`, ip, service)
) ENGINE=InnoDB;
Then, when you insert a row, use this query:
INSERT INTO archive2 (`date`, ip, service, `count`, hostid)
VALUES (CURDATE(), ?ip, ?service, 1, ?hostid)
ON DUPLICATE KEY UPDATE
SET count = count 1;
This will automatically increment your count
column if the row for the ip
, service
, and date
already exists.
Then your second query will look like:
SELECT ip, SUM(`count`) AS total
FROM archive
WHERE ip IN (
SELECT ip FROM archive
WHERE `date` > CURDATE() - INTERVAL 1 DAY
GROUP BY ip
HAVING total > 1
)
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10;
The index of the primary key will satisfy this query.
CodePudding user response:
First query
(I'm not convinced that it can be made much faster.)
(currently)
SELECT service, COUNT(service) AS total
FROM `archive`
WHERE ip IN (
SELECT DISTINCT ip
FROM `archive`
WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
)
GROUP BY service
HAVING total > 1
ORDER BY total DESC, service ASC
LIMIT 10
Notes:
COUNT(service)
-->COUNT(*)
DISTINCT
is not needed inIN (SELECT DISTINCT ...)
IN ( SELECT ... )
is often slow; rewrite usingEXISTS ( SELECT 1 ... )
orJOIN
(see below)INDEX(date, IP)
-- for subqueryINDEX(service, IP)
-- for your outer queryINDEX(IP, service)
-- for my outer query- Toss redundant indexes; they can get in the way. (See below)
- It will have to gather all the possible results before getting to the
ORDER BY
andLIMIT
. (That is,LIMIT
has very little impact on performance for this query.) CHARACTER SET utf8 COLLATE utf8_unicode_ci
is gross overkill for IP addresses; switch toCHARACTER SET ascii COLLATE ascii_bin
.- If you are running MySQL 8.0 (Or MariaDB 10.2), a
WITH
to calculate the subquery once, together with aUNION
to compute the two outer queries, may provide some extra speed. - MariaDB has a "subquery cache" that might have the effect of skipping the second subquery evaluation.
- By using
DATETIME
instead ofTIMESTAMP
, you will two minor hiccups per year when daylight savings kicks in/out. - I doubt if
hostid
needs to be aBIGINT
(8-bytes).
To switch to a JOIN
, think of fetching the candidate rows first:
SELECT service, COUNT(*) AS total
FROM ( SELECT DISTINCT IP
FROM archive
WHERE `date` > NOW() - INTERVAL 24 HOUR
) AS x
JOIN archive USING(IP)
GROUP BY service
HAVING total > 1
ORDER BY total DESC, service ASC
LIMIT 10
For any further discussion any slow (but working) query, please provide both flavors of EXPLAIN
:
EXPLAIN SELECT ...
EXPLAIN FORMAT=JSON SELECT ...
Drop these indexes:
ADD KEY `service` (`service`),
ADD KEY `date` (`date`),
ADD KEY `ip` (`ip`),
Recommend only
ADD PRIMARY KEY (`id`),
-- as discussed:
ADD KEY `date-ip` (`date`,`ip`),
ADD KEY `ip-service` (`ip`,`service`),
ADD KEY `service-ip` (`service`,`ip`),
-- maybe other queries need these:
ADD KEY `date-service` (`date`,`service`),
ADD KEY `ip-date` (`ip`,`date`),
ADD KEY `service-date` (`service`,`date`),
The general rule here is that you don't need INDEX(a)
when you also have INDEX(a,b)
. In particular, they may be preventing the use of better indexes; see the EXPLAINs
.
Second query
Rewrite it
SELECT ip, COUNT(DISTINCT ip) AS total
FROM `archive`
WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
GROUP BY ip
HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC
LIMIT 10
It will use only INDEX(date, ip)
.