how to speed up a query containing HAVING-CodePudding

I have a table with close to a billion records, and need to do a query on it with HAVING. Problem is, it's very slow (about 15 minutes on decent hardware). Is there a way to speed it up?

SELECT ((mean - 3.0E-4)/(stddev/sqrt(N))) as t, ttest.strategyid, mean, stddev, N, 
  kurtosis, strategies.strategyId 
FROM ttest,strategies 
WHERE ttest.strategyid=strategies.id AND dataset=3 AND patternclassid="1" 
  AND exitclassid="1" AND N>= 300 HAVING t>=1.8

I think the problem is that 't' cannot be indexed because it needs to be computed. I cannot add it as a column because the '3.0E-4' will vary per query. Here's my table:

create table ttest (
  strategyid bigint,
  patternclassid integer not null,
  exitclassid integer not null,
  dataset integer not null,
  N integer,
  mean double,
  stddev double,
  skewness double,
  kurtosis double,

  primary key (strategyid, dataset)
);
create index ti3 on ttest (mean);
create index ti4 on ttest (dataset,patternclassid,exitclassid,N);

create table strategies (
  id bigint ,
  strategyId varchar(500),

  primary key(id),
  unique key(strategyId)
);

I saw in other posts it's good to include the output of "explain select..". Here it is, but I'm not sure what to make of it:

 ---- ------------- ------------ ------------ -------- --------------- --------- --------- --------------------------------- --------- ---------- ---------------------------------- 
| id | select_type | table      | partitions | type   | possible_keys | key     | key_len | ref                             | rows    | filtered | Extra                            |
 ---- ------------- ------------ ------------ -------- --------------- --------- --------- --------------------------------- --------- ---------- ---------------------------------- 
|  1 | SIMPLE      | ttest      | NULL       | range  | PRIMARY,ti4   | ti4     | 17      | NULL                            | 1910344 |   100.00 | Using index condition; Using MRR |
|  1 | SIMPLE      | strategies | NULL       | eq_ref | PRIMARY       | PRIMARY | 8       | Jellyfish_test.ttest.strategyid |       1 |   100.00 | Using where                      |
 ---- ------------- ------------ ------------ -------- --------------- --------- --------- --------------------------------- --------- ---------- ----------------------------------

CodePudding user response：

The query needs to reformulated and an index needs to be added.

Plan A:

SELECT  ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) as t,
        tt.strategyid, tt.mean, tt.stddev, tt.N, tt.kurtosis,
        s.strategyId
    FROM  ttest AS tt
    JOIN  strategies AS s  ON tt.strategyid = s.id
    WHERE  tt.dataset = 3
      AND  tt.patternclassid = 1
      AND  tt.exitclassid = 1
      AND  tt.N >= 300
      AND  ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) >= 1.8

and a 'composite' and 'covering' index on test. Replace your ti4 with this (to make it 'covering'):

INDEX(dataset, patternclassid, exitclassid,  -- any order
      N, strategyid)     -- in this order

Plan B:

SELECT  ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) as t,
        tt.strategyid, tt.mean, tt.stddev, tt.N, tt.kurtosis,
        ( SELECT s.strategyId 
              FROM strategies AS s
              WHERE s.id = tt.strategyid = s.id
        ) AS strategyId
    FROM  ttest AS tt
    WHERE  tt.dataset = 3
      AND  tt.patternclassid = 1
      AND  tt.exitclassid = 1
      AND  tt.N >= 300
      AND  ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) >= 1.8

With the same index.

Unfortunately the expression for t needs to be repeated. By moving it from HAVING to WHERE, avoids gathering unwanted rows, only to end up throwing them away. Maybe the optimizer will do that automatically. Please provide EXPLAIN SELECT ... to see.

Also, it is unclear whether one of the two formulations will run faster than the other.

CodePudding user response：

To be honest, I've never seen HAVING being used like this; for 20 years I've assumed it can only be used in GROUP BY situations!

Anyway, IMHO you don't need it here, as Rick James points out, you can put it all in the WHERE. Rewriting it a bit I end up with:

SELECT ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) as t, 
       t.strategyid, 
       t.mean, 
       t.stddev, 
       t.N, 
       t.kurtosis, 
       s.strategyId 
 FROM ttest t,
 JOIN strategies s
   ON s.id = t.strategyid =  
WHERE t.dataset=3 
  AND t.patternclassid="1" 
  AND t.exitclassid="1" 
  AND t.N>= 300 
  AND ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) >= 1.8

Most of that we can indeed foresee a reasonable index. The problem remains with the last calculation:

  AND ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) >= 1.8

However, before we go to that: how many rows are there if you ignore this 'formula'? 100? 200? If so, indexing as foreseen in Rick James' answer should be sufficient IMHO. If it's 1000's or many more than the question becomes: how much of those are thrown out by the formula? 1%? 50% 99%? If it's on the low side then again, indexing as proposed by Rick James will do. If however you only need to keep a few you may want to further optimize this and index accordingly. From your explanation I understand that 3.0E-4 is variable so we can't include it in the index.. so we'll need to extract the parts we can:

If my algebra isn't failing me you can play with the formula like this:

 AND ((t.mean - 3.0E-4) / (t.stddev / sqrt(t.N))) >= 1.8
 AND ((t.mean - 3.0E-4) ) >=  1.8 * (t.stddev / sqrt(t.N))
 AND   t.mean - 3.0E-4    >= (1.8 * (t.stddev / sqrt(t.N)))
 AND          - 3.0E-4    >= (1.8 * (t.stddev / sqrt(t.N))) - t.mean

So the query becomes:

SELECT ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) as t, 
       t.strategyid, 
       t.mean, 
       t.stddev, 
       t.N, 
       t.kurtosis, 
       s.strategyId 
 FROM ttest t,
 JOIN strategies s
   ON s.id = t.strategyid =  
WHERE t.dataset=3 
  AND t.patternclassid="1" 
  AND t.exitclassid="1" 
  AND t.N>= 300 
  AND (1.8 * (t.stddev / sqrt(t.N))) - t.mean <= -3.0E-4

I'm not familiar with mysql but glancing the documentation it should be possible to include 'generated columns' in the index. So, we'll do exactly that with (1.8 * (t.stddev / sqrt(t.N)) - t.mean).

Your indexed fields thus become:

dataset, paternclassid, exitclassid, N, (1.8 * (t.stddev / sqrt(t.N))) - t.mean)

Note that the system will have to calculate this value for each and every row on insert (and possibly update) you do on the table. However, once there (and indexed) it should make the query quite a bit faster.