slow performance on MYSQL recursive cte query-CodePudding

I want to calculate the 9 min Exponential Moving Average (EMA) for a large set of minute stock-data exceeding 30,000,000 rows with roughly 4,500 different Tickers. The query I have makes use of recursive cte due to the nature of the EMA, which is always based on the previous row (minute).

The code works but here is the problem. In order to calculate the EMA for only two stocks which in sum make up 14,000 rows of the minute_data table the query took 19min 40s. Assuming it is just as fast per row for the whole data-set it would take the MYSQL-server between 60h and 70h to execute.

The minute_data table is built as follows:

create table min_data
(
    t            datetime       not null,
    ticker       varchar(10)    not null,
    o            decimal(10, 4) not null,
    h            decimal(10, 4) not null,
    l            decimal(10, 4) not null,
    c            decimal(10, 4) not null,
    primary key (t, ticker)
);

I will only use the following columns:

t = date & time of each trading minute
ticker = Stock Symbol (e.g. Tesla -> TSLA)
c = close price of each trading minute

EMA calculation:
EMA = Closing price (current minute) * alpha EMA (previous minute) * (1-alpha)

SET GLOBAL cte_max_recursion_depth=1000000;

SET @alpha = 2 / (1   9);

CREATE TABLE min_data_EMA9 AS
WITH RECURSIVE t AS (
    SELECT t, ticker,
           row_number() over (partition by ticker order by t) as QuoteId,
           c
    FROM min_data
),

ema (t, ticker, QuoteId, c, EMA9) AS (
    SELECT *, avg(c) as EMA9
    FROM t
    WHERE QuoteId between 1 and 8
    GROUP BY ticker

    UNION ALL

    SELECT t2.t,
           t2.ticker,
           t2.QuoteId,
           t2.c,
           @alpha * t2.c   (1 - @alpha) * EMA9 as EMA9
    FROM ema
    JOIN t t2
        ON ema.QuoteId = t2.QuoteId - 1
        AND  ema.ticker = t2.ticker
)

SELECT t, ticker, QuoteId, EMA9
FROM ema;

When limiting the first select statement with: WHERE ticker = 'TOPS' to one single Stock the EXPLAIN ANALYZE function for the WITHstatement returns the following (Executiontime = 18min 39s):

-> Table scan on ema  (cost=0.01..37822.72 rows=3025619) (actual time=0.002..4.562 rows=68471 loops=1)
    -> Materialize recursive CTE ema  (cost=1395569.12..1433391.84 rows=3025619) (actual time=1097987.024..1097996.206 rows=68471 loops=1)
        -> Table scan on <temporary>  (actual time=0.001..0.001 rows=1 loops=1)
            -> Aggregate using temporary table  (actual time=30081.632..30081.633 rows=1 loops=1)
                -> Filter: (t.QuoteId between 1 and 8)  (cost=1.01..306376.90 rows=302562) (actual time=30069.843..30081.576 rows=8 loops=1)
                    -> Table scan on t  (cost=2.50..2.50 rows=0) (actual time=0.001..3.723 rows=68471 loops=1)
                        -> Materialize CTE t if needed  (cost=2.50..2.50 rows=0) (actual time=30069.836..30078.154 rows=68471 loops=1)
                            -> Window aggregate: row_number() OVER (PARTITION BY min_data.ticker ORDER BY min_data.t )   (actual time=29997.654..30020.406 rows=68471 loops=1)
                                -> Sort: min_data.ticker, min_data.t  (cost=2861564.76 rows=27233289) (actual time=29997.639..30002.597 rows=68471 loops=1)
                                    -> Filter: (min_data.ticker = 'TOPS')  (cost=2861564.76 rows=27233289) (actual time=0.512..29953.891 rows=68471 loops=1)
                                        -> Table scan on min_data  (cost=2861564.76 rows=27233289) (actual time=0.510..27585.660 rows=30323912 loops=1)
        -> Repeat until convergence
            -> Nested loop inner join  (cost=1093007.22 rows=3025619) (actual time=0.010..533731.050 rows=34235 loops=2)
                -> Filter: (ema.ticker is not null)  (cost=34040.61 rows=302561) (actual time=0.004..50.239 rows=34236 loops=2)
                    -> Scan new records on ema  (cost=34040.61 rows=302561) (actual time=0.003..26.824 rows=34236 loops=2)
                -> Filter: (ema.QuoteId = (t2.QuoteId - 1))  (cost=0.25..2.50 rows=10) (actual time=7.784..15.587 rows=1 loops=68471)
                    -> Index lookup on t2 using <auto_key0> (ticker=ema.ticker)  (actual time=0.004..7.431 rows=68471 loops=68471)
                        -> Materialize CTE t if needed (query plan printed elsewhere)  (cost=0.00..0.00 rows=0) (never executed)

I'm new to recursive cte and to a certain extend also to query optimization. Therefore I will be greatfull for your suggestions on how to make this query a lot faster!

CodePudding user response：

Since you are focused on one ticker at a time, this would probably be much better:

PRIMARY KEY(ticker, t)

Are you calculating the EMA from the first and reading to the last? And not saving any intermediate results? What is the factor you are using? Depending on the factor, you really don't need to compute with more than the last hundred or so values.

Do you store the closing price for every minute, even if nothing changes?

Assuming the market never closes for more than a long weekend (that is, no 9/11 hiccups), this should be sufficient (and overkill):

WHERE ticker = ?
  AND t > NOW() - INTERVAL 4 DAYS
ORDER BY t

I strongly suggest that you SELECT the entries and do the EMA in your application. It will be a a lot faster than OVER / CTE / stored function / etc. inside MySQL.

Note that my change to the PRIMARY KEY will make the rows fetched by the WHERE and ORDER BY will be consecutive and already ordered. And, unlike OVER, only one pass will need to be made. In your app, you can simply make one pass over the array received from the SELECT.

Today you have "only 2 stocks"; tomorrow you will have 3. Do one Select for each stock and compute its EMA. That is, loop through the 2 stocks inside your app.

Another thought... Store the EMA at the end of each hour (including the end of the day). Then, when you want to recompute it, pick up from there; there will be no precision lost, but a lot less computation to do.

Store those hourly EMAs in a separate table. The value primes the EMA algorithm very simply. Now you are looking at something like

WHERE ticker = ?
  AND t > CONCAT(LEFT(NOW(), 13), ":00:00")
ORDER BY t

That CONCAT computes the start of the current hour (as a string) and then uses that to compare with t (a DATETIME). Again, my PK and the "do one ticker at a time" suggestions apply.