Home > Net >  GROUP BY - How can I optimize this MySQL query?
GROUP BY - How can I optimize this MySQL query?

Time:06-28

SELECT results.idpatient AS nhs_number, TIMESTAMPDIFF(YEAR, STR_TO_DATE(results.dob,'%Y-%m-%d'), CURDATE()) AS age, most_overdue.days_overdue,
most_overdue.current_status, most_overdue.action, indications.associated_indications
FROM mytable AS results
INNER JOIN (
    SELECT
        idpatient,
        MAX(days_overdue) as days_overdue,
        SUBSTRING(MAX(CONCAT(LPAD(days_overdue, 11, '0'), current_status)), 12) AS current_status,
        SUBSTRING(MAX(CONCAT(LPAD(days_overdue, 11, '0'), action)), 12) AS action
    FROM
        mytable
    GROUP BY
        idpatient
) AS most_overdue
ON results.idpatient = most_overdue.idpatient
INNER JOIN (
    SELECT idpatient, action, GROUP_CONCAT(indication SEPARATOR ',') as associated_indications
    FROM mytable
    GROUP BY idpatient, action
) AS indications
ON results.idpatient = indications.idpatient AND most_overdue.action = indications.action
GROUP BY results.idpatient
LIMIT 0,100;

The above query takes ~4s to run on my big MySQL DB, and the issue seems to be the GROUP BYs.

See a simplified Fiddle here .

I read MySQL's page on optimizing GROUP BY, but this doesn't seem to cover its use in conjunction with GROUP_CONCAT.

The schema is as follows:

CREATE TABLE `mytable` (
  `idpatient` varchar(32) NOT NULL,
  `indication` varchar(255) NOT NULL,
  `action` varchar(255) NOT NULL,
  `current_status` varchar(255) NOT NULL,
  `query_type` varchar(255) NOT NULL,
  `last_date` varchar(255) NOT NULL,
  `days_overdue` bigint(20) DEFAULT NULL,
  `dob` varchar(255) NOT NULL,
  PRIMARY KEY (`idpatient`,`indication`,`action`),
  KEY `action_idx` (`action`),
  KEY `indication_idx` (`indication`),
  KEY `idpatient_action_idx` (`idpatient`,`action`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

EXPLAIN tells me the indication_idx and idpatient_action_idx are in use when running the slow query.

Any help towards getting this under 1s would be much appreciated, thanks!

CodePudding user response:

I guess, without any evidence to support my guess, that you want your resultset to show the hundred lowest idpatient values.

Your query is made of two subqueries and a main query. Let's start with the first subquery.

SELECT 
  idpatient,
  MAX(days_overdue) as days_overdue,
  SUBSTRING(MAX(CONCAT(LPAD(days_overdue, 11, '0'), current_status)), 12) AS current_status,
  SUBSTRING(MAX(CONCAT(LPAD(days_overdue, 11, '0'), action)), 12) AS action
FROM    mytable
GROUP BY idpatient
ORDER BY idpatient
LIMIT    0, 100;

Notice that I included your LIMIT clause in this subquery.

This subquery will benefit from the following covering index.

ALTER TABLE mytable 
  ADD INDEX patient_days_status_activity
            (idpatient, days_overdue, current_status, action);

The first two columns of the index support your GROUP BY operation. If your table is smaller than something like 100k rows you can omit the last two columns from that index; they only serve to make the lookup of current_status and action a bit faster.

With the index in place examine the performance of that subquery; look at EXPLAIN output. Convince yourself it's fast enough for your purposes.

Now for the second subquery. Your idpatient_action_idx index helps with this query. You should add the ORDER BY / LIMIT clauses to it. You could also add indication to that index to make it into a covering index.

ALTER TABLE mytable
  DROP INDEX idpatient_action_idx,
  ADD INDEX idpatient_action_idx (idpatient, action, indication);

Make the index change if you want to. Then examine the performance of that subquery as well. If both the subqueries have acceptable performance your main query probably will too.

CodePudding user response:

Does this query give the correct result (I assume it is still slow)?

SELECT results.idpatient AS nhs_number
     , most_overdue.days_overdue
     , most_overdue.current_status
     , most_overdue.action
     , indications.associated_indications
FROM (
    SELECT distinct idpatient
    FROM mytable
) AS results
INNER JOIN (
    SELECT
        idpatient,
        MAX(days_overdue) as days_overdue,
        SUBSTRING(MAX(CONCAT(LPAD(days_overdue, 11, '0'), current_status)), 12) AS current_status,
        SUBSTRING(MAX(CONCAT(LPAD(days_overdue, 11, '0'), action)), 12) AS action
    FROM
        mytable
    GROUP BY
        idpatient
) AS most_overdue
ON results.idpatient = most_overdue.idpatient
INNER JOIN (
    SELECT idpatient, action, GROUP_CONCAT(indication SEPARATOR ',') as associated_indications
    FROM mytable
    GROUP BY idpatient, action
) AS indications
ON results.idpatient = indications.idpatient AND most_overdue.action = indications.action;

This may be a cleaner way to derive the current_status/action associated with most days_overdue:

SELECT results.idpatient AS nhs_number
     , (most_overdue.days_overdue)
     , (results.current_status)
     , results.action
     , indications.associated_indications
FROM mytable AS results
JOIN (
    SELECT 
        idpatient,
        MAX(days_overdue) as days_overdue
    FROM mytable
    GROUP BY idpatient
) AS most_overdue
    ON results.idpatient = most_overdue.idpatient
    AND results.days_overdue = most_overdue.days_overdue
JOIN (
    SELECT idpatient, action, GROUP_CONCAT(indication SEPARATOR ',') as associated_indications
    FROM mytable
    GROUP BY idpatient, action
) AS indications
     ON results.idpatient = indications.idpatient 
     AND results.action = indications.action
-- GROUP BY results.idpatient, results.action     
ORDER BY results.idpatient;

I can't help thinking that it should be possible to remove at least 1 access against mytable, but right now I don't see how

  • Related