New aggregation code performs worse despite using 1000x less data, with Postgres-CodePudding

I have this table representing trades:

instrument     VARCHAR                     NOT NULL,
ts             TIMESTAMP WITHOUT TIME ZONE NOT NULL,
quantity       FLOAT8                      NOT NULL,
price          FLOAT8                      NOT NULL,
direction      INTEGER                     NOT NULL

and the process writing all the trades also generate 1 minute candles from them:

id               SERIAL PRIMARY KEY,
instrument       VARCHAR                     NOT NULL,
ts               TIMESTAMP WITHOUT TIME ZONE NOT NULL,
open             FLOAT8                      NOT NULL,
high             FLOAT8                      NOT NULL,
low              FLOAT8                      NOT NULL,
close            FLOAT8                      NOT NULL,
midpoint         FLOAT8                      NOT NULL,
volume           FLOAT8                      NOT NULL,
volume_taker_buy FLOAT8                      NOT NULL,
trade_count      INT                         NOT NULL,
UNIQUE (instrument, ts)

I used to have a query building 5 minute candles using the trades as a source:

SELECT
to_timestamp(((extract(epoch FROM ts) * 1000)::bigint / (1000 * 300)) * 300)::timestamp AS ts_trunc,
instrument,
(array_agg(price order by ts))[1] open,
MAX(price) high,
MIN(price) low,
(array_agg(price order by ts))[array_upper((array_agg(price order by ts)), 1)] close,
(SUM(price * price * quantity) / SUM(price * quantity)) midpoint,
SUM(price * quantity) volume,
SUM(CASE WHEN direction = 1 THEN price * quantity else 0 END) volume_taker_buy,
count(*) trade_count
FROM binance.trades
WHERE instrument = 'BTCUSDT' AND ts >= '2022-02-25 14:00:00' AND ts <= '2022-02-25 15:00:00'
GROUP BY ts_trunc, instrument
ORDER BY ts_trunc

but then, since I have the 1 min candles, I thought I'd aggregate them instead since it's much less data:

WITH intervals AS (
  SELECT start, start   interval '5min' AS end
  FROM generate_series('2022-02-25 14:00:00'::timestamp, '2022-02-25 15:00:00'::timestamp, interval '5min') AS start
)
SELECT DISTINCT
    intervals.start                                    AS date,
    instrument,
    first_value(open) OVER w                           AS open,
    max(high) OVER w                                   AS high,
    min(low) OVER w                                    AS low,
    last_value(close) OVER w                           AS close,
    sum(midpoint * volume) OVER w / sum(volume) OVER w as midpoint,
    sum(volume) OVER w                                 AS volume,
    sum(volume_taker_buy) OVER w                       AS volume_taker_buy,
    sum(trade_count) OVER w                            AS trade_count
FROM
  intervals
  JOIN binance.candles c ON c.ts >= intervals.start AND c.ts < intervals.end
WHERE instrument='BTCUSDT'
WINDOW w AS (PARTITION BY intervals.start ORDER BY c.ts ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY intervals.start;

in my data set, the old method pull roughly 70k rows while the new method pulls 60 rows to achieve the same results and yet, the old method performs better:

First method:

GroupAggregate  (cost=63373.79..68786.17 rows=72123 width=80) (actual time=202.846..287.668 rows=12 loops=1)
"  Group Key: ((to_timestamp((((((EXTRACT(epoch FROM ts) * '1000'::numeric))::bigint / 300000) * 300))::double precision))::timestamp without time zone), instrument"
  ->  Sort  (cost=63373.79..63554.31 rows=72207 width=44) (actual time=193.086..204.715 rows=135418 loops=1)
        Sort Key: ((to_timestamp((((((EXTRACT(epoch FROM ts) * '1000'::numeric))::bigint / 300000) * 300))::double precision))::timestamp without time zone)
        Sort Method: external merge  Disk: 7160kB
        ->  Index Scan using idx_trades_instrument_ts on trades  (cost=0.58..57546.74 rows=72207 width=44) (actual time=0.090..156.526 rows=135418 loops=1)
              Index Cond: ((ts >= '2022-02-25 14:00:00'::timestamp without time zone) AND (ts <= '2022-02-25 15:00:00'::timestamp without time zone) AND ((instrument)::text = 'BTCUSDT'::text))
Planning Time: 1.949 ms
Execution Time: 288.711 ms

Second method:

Unique  (cost=1245404.60..1321179.60 rows=3031000 width=88) (actual time=335.226..335.238 rows=12 loops=1)
  ->  Sort  (cost=1245404.60..1252982.10 rows=3031000 width=88) (actual time=335.225..335.229 rows=60 loops=1)
"        Sort Key: start.start, (first_value(c.open) OVER (?)), (max(c.high) OVER (?)), (min(c.low) OVER (?)), (last_value(c.close) OVER (?)), ((sum((c.midpoint * c.volume)) OVER (?) / sum(c.volume) OVER (?))), (sum(c.volume) OVER (?)), (sum(c.volume_taker_buy) OVER (?)), (sum(c.trade_count) OVER (?))"
        Sort Method: quicksort  Memory: 33kB
        ->  WindowAgg  (cost=624519.30..753336.80 rows=3031000 width=88) (actual time=335.138..335.185 rows=60 loops=1)
              ->  Sort  (cost=624519.30..632096.80 rows=3031000 width=84) (actual time=335.125..335.128 rows=60 loops=1)
"                    Sort Key: start.start, c.ts"
                    Sort Method: quicksort  Memory: 33kB
                    ->  Nested Loop  (cost=0.43..132451.50 rows=3031000 width=84) (actual time=335.036..335.108 rows=60 loops=1)
                          ->  Function Scan on generate_series start  (cost=0.00..10.00 rows=1000 width=8) (actual time=335.011..335.012 rows=12 loops=1)
                          ->  Index Scan using candles_instrument_ts_key on candles c  (cost=0.43..102.13 rows=3031 width=76) (actual time=0.004..0.007 rows=5 loops=12)
                                Index Cond: (((instrument)::text = 'BTCUSDT'::text) AND (ts >= start.start) AND (ts < (start.start   '00:05:00'::interval)))
Planning Time: 0.688 ms
JIT:
  Functions: 34
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 17.815 ms, Inlining 28.159 ms, Optimization 174.225 ms, Emission 132.203 ms, Total 352.402 ms"
Execution Time: 353.300 ms

Am I missing something very obvious here? In the end, I'm looking for the fastest way to build this aggregation and I'm confused by the results.

I can do this:

SELECT to_timestamp(((extract(epoch FROM ts) * 1000)::bigint / (1000 * 300)) * 300)::timestamp AS ts_trunc,
    instrument                                       AS instrument,
    (array_agg(open order by ts))[1]                 AS open,
    max(high)                                        AS high,
    min(low)                                         AS low,
    (array_agg(close order by ts))
    [array_upper((array_agg(close order by ts)), 1)] AS close,
    sum(midpoint * volume) / sum(volume)             AS midpoint,
    sum(volume)                                      AS volume,
    sum(volume_taker_buy)                            AS volume_taker_buy,
    sum(trade_count)                                 AS trade_count
FROM binance.candles
WHERE instrument = 'BTCUSDT' AND ts >= '2022-02-25 14:00:00' AND ts < '2022-02-25 15:00:00'
GROUP BY ts_trunc, instrument
ORDER BY ts_trunc

and it is very fast, but I assumed (perhaps wrongly) that array_agg would be slow and a window with first_value / last_value would perform better.

What is the issue here? Since I'm not very experience with SQL, I don't know yet how to really interpret the explain's output.

CodePudding user response：

The main difference is that in the second approach you are using a function generate_series which is the built-in function. Before the select from the binance.candles is executed, Postgresql will need to build the table.

As you can see in the execution plan of the second query, the actual time is spent in the function which generates the candle ranges. See the actual time= in the second execution plan. As you can notice this part the execution spent around 335ms to generate the series which contains 12 rows.

Function Scan on generate_series start  (cost=0.00..10.00 rows=1000 width=8) (actual time=335.011..335.012 rows=12 loops=1)

One of the reasons is that generate_series is a generic function, so when called for a certain datatypes (in this case timestamp) the algorithm will not be efficient when compared to a function that is specialised for the particular data type. You can find lots of discussions on internet related to the performance issues with this particular function.

The third query is avoiding the function, therefore the execution is fast. So, whatever gain you would have expected from using a window instead of array_agg was lost by generating the series.

In addition, you have a Cartesian joint between the data arriving from the table and data from the function:

Nested Loop  (cost=0.43..132451.50 rows=3031000 width=84) (actual time=335.036..335.108 rows=60 loops=1)

so when the data starts growing, and if the plan remains same, I expect that the performance will start degrading.

CodePudding user response：

Essentially all your time (99.7%) is spent setting up the JIT (just in time compilation). Turn off jit to avoid that massive overhead.

Why does the 2nd one do jit and not the first one? That largely comes down to the fact that generate_series thinks it will return 1000 rows but really only returns 12. That makes the while query look way more expensive than it really is, which in turn makes the planner think the high overhead of jit will pay for itself. There are other estimation errors or well, but that is the most obvious one.

generate_series with integer arguments knows how many rows it will return (in recent enough versions), but generate_series with an time interval argument does not and just guess it will return 1000.

jit is a niche feature which is seldom helpful but randomly imposes high costs on the unsuspecting. I think it was a mistake to turn it on by default. You could play around with jit_above_cost, but easier to just turn jit itself off.