poor performance of LEFT JOIN despite query on indexed field-CodePudding

I'm trying to get summary data from a table (dd=devicedata) that gets data every 10 minutes. I want to get the data for one day (24 hours), i.e. get the summary of 144 rows into one.

devicesData is indexed on createdAt and id:

Table       Non_Key_name    Seq_in_index    Column_name Collation   Cardinality Sub_part    Packed  Null    Index_type  Comment Index_comment   Visible Expression
devicesData 0   PRIMARY     1               id          A           165482      NULL    NULL        BTREE           YES NULL
devicesData 1   devesData_1 1               devId       A           21          NULL    NULL        BTREE           YES NULL
devicesData 1   fk_devicesD 1               roomId      A           16          NULL    NULL        BTREE           YES NULL
devicesData 1   createdAt_  1               createdAt   A           156304      NULL    NULL    YES BTREE           YES NULL
devicesData 1   dt_index    1               dt          A           164176      NULL     NULL       BTREE           YES NULL
devicesData 1   idx_roId_dt 1               roomId      A           17          NULL    NULL        BTREE           YES NULL
devicesData 1   idx_roId_dt 2               dt          A           164876      NULL    NULL        BTREE           YES NULL

SELECT
    r.id AS `roomId`, 
    l.id AS `levelId`, 
    b.id AS `buildingId`, 
    s.id AS `siteId`, 
    date(CURDATE() - INTERVAL 1 DAY) AS `date`,
    r.`name` AS `roomName`, 
    l.`name` AS `levelName`, 
    b.`name` AS `buildingName`, 
    s.`name` AS `siteName`,
    r.`size` * r.`eui` * cdd.`value` / 3280 AS `referenceConsumption`,
    SUM(dd.activeEnergy) AS `energy`, 
    SUM(dd.coolingEnergy) AS `coolingEnergy`, 
    SUM(dd.fanTime) AS `fanTime`, 
    SUM(dd.comprTime) AS `comprTime`, 
    SUM(dd.presence) AS `presenceTime`,
    AVG(dd.temp1) AS `temp1`,
    COUNT(dd.id) * 10 AS `conTime`
FROM rooms r   
JOIN levels AS l ON l.id = r.levelId
JOIN buildings AS b ON b.id = l.buildingId
JOIN sites AS s ON s.id = b.siteId
LEFT JOIN Devices.coolingDegreeDays AS cdd
ON cdd.`day` = date(CURDATE() - INTERVAL 1 DAY)
LEFT JOIN devicesData dd
ON dd.roomId = r.id AND date(dd.dt) = date(CURDATE() - INTERVAL 1 DAY)
GROUP BY r.id;

For a minimal data set of 32 rooms, i.e. a maximum 32 x 144 entries, the response time is .5 seconds. Checking on a larger data set of 183 rooms x 144 rows = 26350 rows, the response time is 4 minutes. That is acceptable but still seems ridiculously slow for this small data set.

Note that a number of rows in devicesData are not populated (no data for that timestamp), so there is no data available for that id and time.

EXPLAIN result:

Field   Type    Null    Key Default Extra  
id      int     NO      PRI NULL    auto_increment  
devId   varchar NO      MUL NULL      
roomId  int     NO      MUL NULL      
dt      datetim NO      MUL CURRENT_TIMESTAMP   DEFAULT_GENERATED  
temp1   float   YES         NULL      
temp2   float   YES         NULL      
hum1    float   YES         NULL      
relay1  tinyint YES         NULL      
relay2  tinyint YES         NULL      
presencefloat   YES         NULL    
fanTime int     YES         NULL    
comprTimint     YES         NULL    
activeEnfloat   YES         NULL    
reactEnefloat   YES         NULL    
airflowAfloat   YES         NULL    
coolingEfloat   YES         NULL    
createdAint uns YES    MUL  NULL

I suspect that the culprit is this line:

LEFT JOIN devicesData dd
ON dd.roomId = r.id AND date(dd.dt) = date(CURDATE() - INTERVAL 1 DAY)

since it needs to evaluate date(dd.dt) for every line and will not benefit from the index.

Experimenting by replacing with a query on indexed field dd.dt (deviceData dateTime):

LEFT JOIN devicesData dd
ON dd.roomId = r.id
AND dd.createdAt > unix_timestamp(CURDATE() - INTERVAL 1 DAY)
AND dd.createdAt < unix_timestamp(CURDATE())

the performance is poorer--a query which takes .5 seconds for 32 rooms now takes 2.5 seconds.

Is there an issue with the index or what is the reason for this relatively poor performance?

CodePudding user response：

Please include the EXPLAIN output for your query. It is possible that the stats have not updated and the index is not being used as you expect. ANALYZE TABLE devicesData; will update the stats.

As you have identified, the likely culprit is -

LEFT JOIN devicesData dd ON dd.roomId = r.id and date(dd.dt) = date(CURDATE()- INTERVAL 1 DAY)

The use of the DATE() function on both sides of the join condition means the engine is unable to use any index covering dt column. Currently, there is no index covering the dt column. Changing the join and adding the index should improve things -

LEFT JOIN devicesData dd
    ON dd.roomId = r.id
    AND dd.dt BETWEEN (CURDATE()- INTERVAL 1 DAY) AND (CURDATE()- INTERVAL 1 SECOND)

Adding a composite index on (roomID, dt) should improve performance -

ALTER TABLE `devicesData` ADD INDEX `idx_roomId_dt` (`roomID`, `dt`);

CodePudding user response：

(nnichols gives a good suggestion for making dd.dt sargeable)

Beware of SUM() with JOIN. All the JOINing (including NULLs for empty LEFT JOINs) is done first and written [logically] to a big temp table. Then the SUMs are done. Check your output for having counts and sums that are too big.

Before I rewrite the entire query, please check to see that this produces the correct SUMs:

SELECT r.id, SUM(dd...) as energy, ...
    FROM rooms r
    JOIN devicesData dd  ON dd.roomId = r.id
    WHERE dd.dt >= CURDATE() - INTERVAL 1 DAY
      AND dd.dt  < CURDATE()
    GROUP BY r.id;

If that is a 1:1 mapping, then you don't need the SUMs?

If that looks OK, then

SELECT ...
    FROM rooms r
    LEFT JOIN ( the above query ) AS sums  ON 
    JOIN  ... (the rest of the stuff)
    -- no GROUP BY

In any case, these indexes may be beneficial:

r:  INDEX(levelId, id,  name, size, eui)
cdd:  INDEX(day,  value)
dd:  INDEX(roomId, dt)
Devices:  INDEX(coolingDegreeDays)