Home > Software engineering >  Fast PostgreSQL query with GROUP BY, COUNT(DISTINCT) and SUM on differrent columns
Fast PostgreSQL query with GROUP BY, COUNT(DISTINCT) and SUM on differrent columns

Time:05-02

I am trying to query a table with approximately 1.5 million records. I have indexes and it performs well.

However, one of the columns I want to get a COUNT of a distinct column (that have many duplicates). When I do DISTINCT vs not its 10x slower.

This is the query:

SELECT
    created_at,
    SUM(amount) as total,
    COUNT(DISTINCT partner_id) as count_partners
FROM 
    consumption
WHERE
    is_official = true
    AND
    (is_processed = true OR is_deferred = true)
GROUP BY created_at

This takes 2.5s

If I make it:

COUNT(partner_id) as count_partners

It takes 230ms. But this is not what I want.

I want a unique set of partners per grouping (date) as well as a sum of the amounts they have consumed in that period.

I don't understand why this is so much slower. PostgreSQL seems to be very quick creating an array of all the duplicates, why does simply adding DISTINCT to it trash its performance?

Query Plan:

GroupAggregate  (cost=85780.70..91461.63 rows=12 width=24) (actual time=1019.428..2641.434 rows=13 loops=1)
  Output: created_at, sum(amount), count(DISTINCT partner_id)" 
Group Key: p.created_at
  Buffers: shared hit=16487
  ->  Sort  (cost=85780.70..87200.90 rows=568081 width=16) (actual time=865.599..945.674 rows=568318 loops=1)
        Output: created_at, amount, partner_id
        Sort Key: p.created_at
        Sort Method: quicksort  Memory: 62799kB
        Buffers: shared hit=16487
        ->  Seq Scan on public.consumption p  (cost=0.00..31484.26 rows=568081 width=16) (actual time=0.020..272.126 rows=568318 loops=1)
              Output: created_at, amount, partner_id
              Filter: (p.is_official AND (p.is_deferred OR p.is_processed))
              Rows Removed by Filter: 931408
              Buffers: shared hit=16487
Planning Time: 0.191 ms
Execution Time: 2647.629 ms

Indexes:

CREATE INDEX IF NOT EXISTS i_pid ON consumption (partner_id);
CREATE INDEX IF NOT EXISTS i_processed ON consumption (is_processed);
CREATE INDEX IF NOT EXISTS i_official ON consumption (is_official);
CREATE INDEX IF NOT EXISTS i_deferred ON consumption (is_deferred);
CREATE INDEX IF NOT EXISTS i_created ON consumption (created_at);

CodePudding user response:

The following query should be able to benefit from the indexes.

SELECT  
  created_at,     
  SUM(amount) AS total,     
  COUNT(DISTINCT partner_id) AS count_partners 
FROM      
  (SELECT 
    created_at,         
    sum(amount) as amount,         
    partner_id   
  FROM consumption   
  WHERE is_official = true     
    AND (is_processed = true OR is_deferred = true)   
  GROUP BY 
    created_at,
    partner_id   
  ) AS c  
GROUP BY created_at;
  • Related