I am trying to query a table with approximately 1.5 million records. I have indexes and it performs well.
However, one of the columns I want to get a COUNT of a distinct column (that have many duplicates). When I do DISTINCT vs not its 10x slower.
This is the query:
SELECT
created_at,
SUM(amount) as total,
COUNT(DISTINCT partner_id) as count_partners
FROM
consumption
WHERE
is_official = true
AND
(is_processed = true OR is_deferred = true)
GROUP BY created_at
This takes 2.5s
If I make it:
COUNT(partner_id) as count_partners
It takes 230ms. But this is not what I want.
I want a unique set of partners per grouping (date) as well as a sum of the amounts they have consumed in that period.
I don't understand why this is so much slower. PostgreSQL seems to be very quick creating an array of all the duplicates, why does simply adding DISTINCT to it trash its performance?
Query Plan:
GroupAggregate (cost=85780.70..91461.63 rows=12 width=24) (actual time=1019.428..2641.434 rows=13 loops=1)
Output: created_at, sum(amount), count(DISTINCT partner_id)"
Group Key: p.created_at
Buffers: shared hit=16487
-> Sort (cost=85780.70..87200.90 rows=568081 width=16) (actual time=865.599..945.674 rows=568318 loops=1)
Output: created_at, amount, partner_id
Sort Key: p.created_at
Sort Method: quicksort Memory: 62799kB
Buffers: shared hit=16487
-> Seq Scan on public.consumption p (cost=0.00..31484.26 rows=568081 width=16) (actual time=0.020..272.126 rows=568318 loops=1)
Output: created_at, amount, partner_id
Filter: (p.is_official AND (p.is_deferred OR p.is_processed))
Rows Removed by Filter: 931408
Buffers: shared hit=16487
Planning Time: 0.191 ms
Execution Time: 2647.629 ms
Indexes:
CREATE INDEX IF NOT EXISTS i_pid ON consumption (partner_id);
CREATE INDEX IF NOT EXISTS i_processed ON consumption (is_processed);
CREATE INDEX IF NOT EXISTS i_official ON consumption (is_official);
CREATE INDEX IF NOT EXISTS i_deferred ON consumption (is_deferred);
CREATE INDEX IF NOT EXISTS i_created ON consumption (created_at);
CodePudding user response:
The following query should be able to benefit from the indexes.
SELECT
created_at,
SUM(amount) AS total,
COUNT(DISTINCT partner_id) AS count_partners
FROM
(SELECT
created_at,
sum(amount) as amount,
partner_id
FROM consumption
WHERE is_official = true
AND (is_processed = true OR is_deferred = true)
GROUP BY
created_at,
partner_id
) AS c
GROUP BY created_at;