Home > Enterprise >  Interpretation of Postgres Actual Loops in EXPLAIN ANALYZE
Interpretation of Postgres Actual Loops in EXPLAIN ANALYZE

Time:06-03

When reading a Postgres EXPLAIN ANALYZE plan, the documentation states that under some conditions, operators may be executed more than once (as indicated by Actual Loops). In these cases, some performance measurements are shown as per-loop, rather than aggregated for the entire operator (e.g. Actual Total Time and Actual Rows):

In some query plans, it is possible for a subplan node to be executed more than once. For example, the inner index scan will be executed once per outer row in the above nested-loop plan. In such cases, the loops value reports the total number of executions of the node, and the actual time and rows values shown are averages per-execution. This is done to make the numbers comparable with the way that the cost estimates are shown. Multiply by the loops value to get the total time actually spent in the node. In the above example, we spent a total of 0.220 milliseconds executing the index scans on tenk2.

However, I found multiple instances of query plans where the Actual Total Time of a sub-node becomes larger than the total time of the root node, when multiplied by the number of loops. It seems like in these cases, the total time reflects the actual total time, rather than the per-loop average (example below).

So, under which conditions is it necessary to multiply the values of Actual Rows and Actual Total Time by Actual Loops in order to get the correct results, and under which conditions are the values already aggregated correctly?

Example:

The query explain analyze select count(*) from name n join person_info pi on n.id = pi.person_id;

produces the following plan:

    QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=233201.06..233201.07 rows=1 width=8) (actual time=1770.430..1823.586 rows=1 loops=1)
   ->  Gather  (cost=233200.84..233201.05 rows=2 width=8) (actual time=1769.845..1823.580 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=232200.84..232200.85 rows=1 width=8) (actual time=1768.782..1768.813 rows=1 loops=3)
               ->  Parallel Hash Join  (cost=145078.16..227887.97 rows=1725150 width=0) (actual time=1234.256..1611.255 rows=1376736 loops=3)
                     Hash Cond: (pi.person_id = n.id)
                     ->  Parallel Index Only Scan using person_id_person_info on person_info pi  (cost=0.43..78281.72 rows=1725150 width=4) (actual time=0.086..195.010 rows=1376736 loops=3)
                           Heap Fetches: 0
                     ->  Parallel Hash  (cost=111851.77..111851.77 rows=2658077 width=4) (actual time=782.084..782.085 rows=2126580 loops=3)
                           Buckets: 131072  Batches: 128  Memory Usage: 3040kB
                           ->  Parallel Seq Scan on name n  (cost=0.00..111851.77 rows=2658077 width=4) (actual time=0.018..345.728 rows=2126580 loops=3)
 Planning Time: 0.133 ms
 Execution Time: 1823.752 ms
(14 rows)

The final aggregate has a total runtime of 1823.586 ms. However, the last Parallel Hash operation has a runtime of 782.085 ms with 3 loops. According to the documentation this would lead to a runtime of 782.085 * 3 ms = 2346.255 ms which is larger than the total runtime of the root node. Since timings always include their children as well, this cannot be.

CodePudding user response:

I won't be able to give an exhaustive answer, but the rule of thumb is as follows:

  • the following only applies to nodes below Gather, as those are parallelized

  • row counts, "rows removed by filter", heap fetches and similar counts are divided by the number of processes ("loops"), so you have to multiply them to get the total

  • execution times and buffer counts are not divided

For the times, this is obvious: if three parallel processes work for a second to compute a result, it only takes one second until they are done, not three seconds. That's the whole point of parallel query!

  • Related