I am given a task to optimize the following query (not written by me)
SELECT
"u"."email" as email,
r.url as "domain",
"u"."id" as "requesterId",
s.total * 100 / count("r"."id") as "rate",
count(("r"."url", "u"."email", "u"."id", s."total")) OVER () as total
FROM
(
SELECT
url,
id,
"requesterId",
created_at
FROM
(
SELECT
url,
id,
"requesterId",
created_at,
row_number() over (partition by main_request_uuid) as row_number
FROM
"requests" "request"
GROUP BY
main_request_uuid,
retry_number,
url,
id,
"requesterId",
created_at
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE
request_.row_number = 1
) "r"
INNER JOIN (
SELECT
"requesterId",
url,
count(created_at) AS "total"
FROM
(
SELECT
url,
status,
created_at,
"requesterId"
FROM
(
SELECT
url,
status,
created_at,
"requesterId",
row_number() over (partition by main_request_uuid) as row_number
FROM
"requests" "request"
GROUP BY
main_request_uuid,
retry_number,
url,
status,
created_at,
"requesterId"
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE
request_.row_number = 1
) "s"
WHERE
status IN ('success')
AND s."created_at" :: date >= '2022-01-07' :: date
AND s."created_at" :: date <= '2022-02-07' :: date
GROUP BY
s.url,
s."requesterId"
) "s" ON s."requesterId" = r."requesterId"
AND s.url = r.url
INNER JOIN "users" "u" ON "u"."id" = r."requesterId"
WHERE
r."created_at" :: date >= '2022-01-07' :: date
AND r."created_at" :: date <= '2022-02-07' :: date
GROUP BY
r.url,
"u"."email",
"u"."id",
s.total
LIMIT
10
So there is the requests
table, which stores some API requests and there is a mechanism to retry a request if it fails, which is repeated 5 times, while keeping separate rows for each retry. If after 5 times it still fails, it's not continued anymore. This is the reason for the partition by
subquery, which selects only the main requests.
What the query should return is the total number of requests and success rate, grouped by the url
and requesterId
. The query I was given is not only wrong, but also takes huge amounts of time to execute, so I came up with the optimized version below
WITH a AS (SELECT url,
id,
status,
"requesterId",
created_at
FROM (
SELECT url,
id,
status,
"requesterId",
created_at,
row_number() over (partition by main_request_uuid) as row_number
FROM "requests" "request"
WHERE
created_at:: date >= '2022-01-07' :: date
AND created_at :: date <= '2022-02-07' :: date
GROUP BY main_request_uuid,
retry_number,
url,
id,
status,
"requesterId",
created_at
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE request_.row_number = 1),
b AS (SELECT count(*) total, a2.url as url, a2."requesterId" FROM a a2 GROUP BY a2.url, a2."requesterId"),
c AS (SELECT count(*) success, a3.url as url, a3."requesterId"
FROM a a3
WHERE status IN ('success')
GROUP BY a3.url, a3."requesterId")
SELECT success * 100 / total as rate, b.url, b."requesterId", total, email
FROM b
JOIN c ON b.url = c.url AND b."requesterId" = c."requesterId" JOIN users u ON b."requesterId" = u.id
LIMIT 10;
What the new version basically does is select all the main requests, and count the successful ones and the total count. The new version still takes a lot of time to execute (around 60s on a table with 4 million requests).
Is there a way to optimize this further?
You can see the table structure below. The table has no relevant indexes, but adding one on (url, requesterId)
had no effect
column_name | data_type |
---|---|
id | bigint |
requesterId | bigint |
proxyId | bigint |
url | character varying |
status | USER-DEFINED |
time_spent | integer |
created_at | timestamp with time zone |
request_information | jsonb |
retry_number | smallint |
main_request_uuid | character varying |
And here is the execution plan on a backup table with 100k rows. It's taking 1.1s for 100k rows, but it would be more desired to at least cut it down to 200ms for this case
Limit (cost=15196.40..15204.56 rows=1 width=77) (actual time=749.664..1095.476 rows=10 loops=1)
CTE a
-> Subquery Scan on request_ (cost=15107.66..15195.96 rows=3 width=159) (actual time=226.805..591.188 rows=49474 loops=1)
Filter: (request_.row_number = 1)
Rows Removed by Filter: 70962
-> WindowAgg (cost=15107.66..15188.44 rows=602 width=206) (actual time=226.802..571.185 rows=120436 loops=1)
-> Group (cost=15107.66..15179.41 rows=602 width=198) (actual time=226.797..435.340 rows=120436 loops=1)
" Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
-> Gather Merge (cost=15107.66..15170.62 rows=502 width=198) (actual time=226.795..386.198 rows=120436 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Group (cost=14107.64..14112.66 rows=251 width=198) (actual time=212.749..269.504 rows=40145 loops=3)
" Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
-> Sort (cost=14107.64..14108.27 rows=251 width=198) (actual time=212.744..250.031 rows=40145 loops=3)
" Sort Key: request.main_request_uuid, request.retry_number DESC, request.url, request.id, request.status, request.""requesterId"", request.created_at"
Sort Method: external merge Disk: 7952kB
Worker 0: Sort Method: external merge Disk: 8568kB
Worker 1: Sort Method: external merge Disk: 9072kB
-> Parallel Seq Scan on requests request (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
-> Nested Loop (cost=0.43..8.59 rows=1 width=77) (actual time=749.662..1095.364 rows=10 loops=1)
" Join Filter: (a2.""requesterId"" = u.id)"
-> Nested Loop (cost=0.16..0.28 rows=1 width=64) (actual time=749.630..1095.163 rows=10 loops=1)
" Join Filter: (((a2.url)::text = (a3.url)::text) AND (a2.""requesterId"" = a3.""requesterId""))"
Rows Removed by Join Filter: 69
-> HashAggregate (cost=0.08..0.09 rows=1 width=48) (actual time=703.128..703.139 rows=10 loops=1)
" Group Key: a3.url, a3.""requesterId"""
Batches: 5 Memory Usage: 4297kB Disk Usage: 7040kB
-> CTE Scan on a a3 (cost=0.00..0.07 rows=1 width=40) (actual time=226.808..648.251 rows=41278 loops=1)
Filter: (status = 'success'::requests_status_enum)
Rows Removed by Filter: 8196
-> HashAggregate (cost=0.08..0.11 rows=3 width=48) (actual time=38.103..38.105 rows=8 loops=10)
" Group Key: a2.url, a2.""requesterId"""
Batches: 41 Memory Usage: 4297kB Disk Usage: 7328kB
-> CTE Scan on a a2 (cost=0.00..0.06 rows=3 width=40) (actual time=0.005..7.419 rows=49474 loops=10)
" -> Index Scan using ""PK_a3ffb1c0c8416b9fc6f907b7433"" on users u (cost=0.28..8.29 rows=1 width=29) (actual time=0.015..0.015 rows=1 loops=10)"
" Index Cond: (id = a3.""requesterId"")"
Planning Time: 1.494 ms
Execution Time: 1102.488 ms
CodePudding user response:
These lines of your plan point to a possible optimization.
-> Parallel Seq Scan on requests request (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
Sequential scans, parallel or not, are somewhat costly.
So, try changing these WHERE conditions to make them sargable and useful for a range scan.
created_at:: date >= '2022-01-07' :: date
AND created_at :: date <= '2022-02-07' :: date
Change these to
created_at >= '2022-01-07' :: date
AND created_at < '2022-01-07' :: date INTERVAL '1' DAY
And, put a BTREE index on the created_at
column.
CREATE INDEX ON requests (created_at);
Your query is complex, so I'm not totally sure this will work. But try it. The index should pull out only the rows for the dates you need.
And, your LIMIT
clause without an accompanying ORDER BY
clause gives postgreSQL permission to give you back whatever ten rows it wants from the result set. Don't use LIMIT
without ORDER BY
. Don't do it at all unless you need it.
CodePudding user response:
Writing query efficiently is one of the major part for query optimization specially for handling huge data set. Always avoiding unnecessary GROUP BY or ORDER BY or explicit type casting or too many joins or extra subquery or limit/limit without order by (if possible) if handling large volume of data and meet desired requirement. Create an index in created_at column. If LEFT JOIN used in your given query then query pattern would be changed. My observations are
-- avoid unnecessary GROUP BY (no aggregate function use) or ORDER BY
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
N.B.: if created_at column data type is timestamp without time zone then no need to extra casting. Follow the below statement
(created_at >= '2022-01-07'
AND created_at <= '2022-02-07')
-- Combine two CTE into one as per requirement
SELECT url, "requesterId", COUNT(1) total
, COUNT(1) FILTER (WHERE status = 'success') success
FROM a
-- WHERE status IN ('success')
GROUP BY url, "requesterId"
So the final query as like below
WITH a AS (
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
), b as (
SELECT url, "requesterId", COUNT(1) total
, COUNT(1) FILTER (WHERE status = 'success') success
FROM a
-- WHERE status IN ('success')
GROUP BY url, "requesterId"
) select (success * 100) / total as rate
, b.url, b."requesterId", total, email
from b
JOIN users u
ON u.id = b."requesterId"
limit 10;
If above query doesn't meet requirement then try below query. But this query will perfect if using LEFT JOIN instead of INNER JOIN
WITH a AS (
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
), b as (
SELECT count(1) total, a2.url as url, a2."requesterId"
FROM a a2
GROUP BY a2.url, a2."requesterId"
), c AS (SELECT count(1) success, a3.url as url, a3."requesterId"
FROM a a3
WHERE status = 'success'
GROUP BY a3.url, a3."requesterId")
SELECT success * 100 / total as rate, b.url, b."requesterId", total, email
FROM b
JOIN c
ON b.url = c.url AND b."requesterId" = c."requesterId"
JOIN users u
ON b."requesterId" = u.id
LIMIT 10;
select (success * 100) / total as rate
, b.url, b."requesterId", total, email
from b
JOIN users u
ON u.id = b."requesterId"
limit 10;