Postgres Query Optimization for Percentiles of Sales by Region-CodePudding

I'm computing some percentile sales data grouped by product / US state, and the query is extremely slow for a large list of locations (see "dma" in [long list of locations]). The table is around ~15 million rows.

SELECT 'us_state' AS type, us_state AS type_id, product, COUNT(*) AS count, MIN(depletions) AS min, MAX(depletions) AS max, SUM(depletions) AS sum, MIN(start_date) AS start_date, MAX(end_date) AS end_date, 
  percentile_cont(0.2) WITHIN GROUP (ORDER BY depletions) AS p20,
  percentile_cont(0.4) WITHIN GROUP (ORDER BY depletions) AS p40,
  percentile_cont(0.5) WITHIN GROUP (ORDER BY depletions) AS p50,
  percentile_cont(0.6) WITHIN GROUP (ORDER BY depletions) AS p60,
  percentile_cont(0.8) WITHIN GROUP (ORDER BY depletions) AS p80 
FROM "venue_product_depletions_locations" 
WHERE "account_id" = 6352 AND "dma" IN ('862','716','532','551','521','622','752','575','604','881','576','555','801','825','569','641','506','588','531','619','855','636','790','623','523','567','512','519','807','819','581','508','600','751','839','691','518','658','693','630','510','635','789','669','661','709','770','673','705','736','500','644','544','866','507','821','520','556','686','592'',89','522','560','557','676','573','529','515','516','570','602','679','634','540','541','678','534','828','671','613','649','771','640','757','659','550','633','698','528','504','804','625','548','559','637','535','618','527','611','517','617','670','561','571','813'',65','765','509','514','566','753','538','650','820','662','546','539','803','563','582','810','501','616','773','758','609','524','530','584','702','533','656','505','511','547') AND (us_state IS NOT NULL) 
GROUP BY product, us_state

query plan image

I've got this really expensive sort in the query plan that I can't figure out how to optimize. I've tried indexes on:

account_id & dma
product & us_state
product, us_state & depletions

Currently the query takes ~21 seconds. If I remove the percentile_cont selects it completes in ~7.5 seconds. I'd like to get it as fast as possible.

Here's the query plan as well (updated to EXPLAIN (ANALYZE, BUFFERS))

GroupAggregate  (cost=1110518.81..1330560.79 rows=44443 width=137) (actual time=17829.623..26151.760 rows=5799 loops=1)
  Group Key: product, us_state
  Buffers: shared hit=3279444, temp read=44945 written=44950
  ->  Sort  (cost=1110518.81..1126164.67 rows=6258343 width=41) (actual time=17829.559..21253.068 rows=6220661 loops=1)
        Sort Key: product, us_state
        Sort Method: external merge  Disk: 359560kB
        Buffers: shared hit=3279444, temp read=44945 written=44950
        ->  Index Scan using account_id_dma_index on venue_product_depletions_locations  (cost=0.43..291274.59 rows=6258343 width=41) (actual time=0.207..4629.167 rows=6220661 loops=1)
              Index Cond: ((account_id = 6352) AND ((dma)::text = ANY ('{862,716,532,551,521,622,752,575,604,881,576,555,801,825,569,641,506,588,531,619,855,636,790,623,523,567,512,519,807,819,581,508,600,751,839,691,518,658,693,630,510,635,789,669,661,709,770,673,705,736,500,644,544,866,507,821,520,556,686,"592'',89",522,560,557,676,573,529,515,516,570,602,679,634,540,541,678,534,828,671,613,649,771,640,757,659,550,633,698,528,504,804,625,548,559,637,535,618,527,611,517,617,670,561,571,"813'',65",765,509,514,566,753,538,650,820,662,546,539,803,563,582,810,501,616,773,758,609,524,530,584,702,533,656,505,511,547}'::text[])))
              Filter: (us_state IS NOT NULL)
              Rows Removed by Filter: 10187
              Buffers: shared hit=3279444
Planning Time: 0.870 ms
Execution Time: 26200.913 ms

CodePudding user response：

Are you sure that accont_id is a string ??? I will supprose it's a INT...

Turn the contents of your IN clause into an indexed temporary table and use a join :

CREATE TEMPORAY TABLE TT_DMA (DMA_ID INT PRIMARY KEY);

INSERT INTO TT_DMA VALUES
(862), (716), (532), (551), (521), (622), (752), (575), (604), (881), (576), 
(555), (801), (825), (569), (641), (506), (588), (531), (619), (855), (636), 
(790), (623), (523), (567), (512), (519), (807), (819), (581), (508), (600), 
(751), (839), (691), (518), (658), (693), (630), (510), (635), (789), (669), 
(661), (709), (770), (673), (705), (736), (500), (644), (544), (866), (507), 
(821), (520), (556), (686), (592), (89), (522), (560), (557), (676), (573),
(529), (515), (516), (570), (602), (679), (634), (540), (541), (678), (534),
(828), (671), (613), (649), (771), (640), (757), (659), (550), (633), (698),
(528), (504), (804), (625), (548), (559), (637), (535), (618), (527), (611),
(517), (617), (670), (561), (571), (813), (65), (765), (509), (514), (566), 
(753), (538), (650), (820), (662), (546), (539), (803), (563), (582), (810), 
(501), (616), (773), (758), (609), (524), (530), (584), (702), (533), (656), 
(505), (511), (547);


SELECT us_state AS type, us_state AS type_id, product, COUNT(*) AS count, 
       MIN(depletions) AS min, MAX(depletions) AS max, SUM(depletions) AS sum, 
       MIN(start_date) AS start_date, MAX(end_date) AS end_date, 
       percentile_cont(0.2) WITHIN GROUP (ORDER BY depletions) AS p20,
       percentile_cont(0.4) WITHIN GROUP (ORDER BY depletions) AS p40,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY depletions) AS p50,
       percentile_cont(0.6) WITHIN GROUP (ORDER BY depletions) AS p60,
       percentile_cont(0.8) WITHIN GROUP (ORDER BY depletions) AS p80 
FROM   "venue_product_depletions_locations" 
       JOIN TT_DMA ON dma = DMA_ID
WHERE  "account_id" = 6352 
       AND "dma" 
       AND (us_state IS NOT NULL) 
GROUP  BY product, us_state;

CodePudding user response：

When you don't use the percentile functions, it probably uses a HashAggregate rather than a GroupAggregate. But HashAggregate is not an option with the percentile functions, as they need to see all the within-group data at the same time.

You can avoid the sort by making an index which can provide the data already in order such as (account_id, product, us_state). But that is likely to need to jump all around in the table to get the rows fetched in the index order, so to really get a big improvement you might need to add the rest of the columns used by this query to the end of the index so you can get an index-only scan.