Understanding SQL EXPLAIN on a JOIN query-CodePudding

I'm struggling to make sense of postgres EXPLAIN to figure out why my query is slow. Can someone help? This is my query, it's a pretty simple join:

SELECT DISTINCT graph_concepts.* 
FROM graph_concepts 
INNER JOIN graph_family_links 
        ON graph_concepts.id = graph_family_links.descendent_concept_id 
WHERE graph_family_links.ancestor_concept_id = 1016 
AND graph_family_links.generation = 1 
AND graph_concepts.state != 2

It's starting from a concept and it's getting a bunch of related concepts through the links table.

Notably, I have an index on graph_family_links.descendent_concept_id, yet this query takes about 3 seconds to return a result. This is way too long for my purposes.

This is the SQL explain:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=46347.01..46846.16 rows=4485 width=108)
   ->  Gather Merge  (cost=46347.01..46825.98 rows=4485 width=108)
         Workers Planned: 1
         ->  Sort  (cost=45347.01..45348.32 rows=2638 width=108)
               Sort Key: graph_concepts.id, graph_concepts.definition, graph_concepts.checkvist_task_id, graph_concepts.primary_question_id, graph_concepts.created_at, graph_concepts.updated_at, graph_concepts.tsn, graph_concepts.state, graph_concepts.search_phrases
               ->  Nested Loop  (cost=301.97..45317.02 rows=2638 width=108)
                     ->  Parallel Bitmap Heap Scan on graph_family_links  (cost=301.88..39380.60 rows=2640 width=4)
                           Recheck Cond: (ancestor_concept_id = 1016)
                           Filter: (generation = 1)
                           ->  Bitmap Index Scan on index_graph_family_links_on_ancestor_concept_id  (cost=0.00..301.66 rows=38382 width=0)
                                 Index Cond: (ancestor_concept_id = 1016)
                     ->  Index Scan using graph_concepts_pkey on graph_concepts  (cost=0.08..2.25 rows=1 width=108)
                           Index Cond: (id = graph_family_links.descendent_concept_id)
                           Filter: (state <> 2)

I'm doing lots of googling to help me figure out how to read this EXPLAIN and I'm struggling. Can someone help translate this into plain english for me?

CodePudding user response：

SELECT DISTINCT
    graph_concepts.* 
FROM
    graph_concepts
    INNER JOIN graph_family_links ON graph_concepts.id = graph_family_links.descendent_concept_id 
WHERE
    graph_family_links.ancestor_concept_id = 1016 
    AND graph_family_links.generation = 1 
    AND graph_concepts.state != 2

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=46347.01..46846.16 rows=4485 width=108)
   ## (Merge records DISTINCT)
   ->  Gather Merge  (cost=46347.01..46825.98 rows=4485 width=108)
         Workers Planned: 1
                 
        ## (Sort table graph_concepts.* )
         ->  Sort  (cost=45347.01..45348.32 rows=2638 width=108)
               Sort Key: graph_concepts.id, graph_concepts.definition, graph_concepts.checkvist_task_id, graph_concepts.primary_question_id, graph_concepts.created_at, graph_concepts.updated_at, graph_concepts.tsn, graph_concepts.state, graph_concepts.search_phrases
                             
                             
               ->  Nested Loop  (cost=301.97..45317.02 rows=2638 width=108)
                     ## WHERE graph_family_links.ancestor_concept_id = 1016     (Use Parallel Bitmap Heap Scan table and filter record)
                     ->  Parallel Bitmap Heap Scan on graph_family_links  (cost=301.88..39380.60 rows=2640 width=4)
                           Recheck Cond: (ancestor_concept_id = 1016)
                           Filter: (generation = 1)
                                                     
                            ## AND graph_family_links.generation = 1 (Use Bitmap Index Scan table and filter record)
                           ->  Bitmap Index Scan on index_graph_family_links_on_ancestor_concept_id  (cost=0.00..301.66 rows=38382 width=0)
                                 Index Cond: (ancestor_concept_id = 1016)
                     
                                         
                                ## AND graph_concepts.state != 2 (Use Index Scan table and filter record)
                                ->  Index Scan using graph_concepts_pkey on graph_concepts  (cost=0.08..2.25 rows=1 width=108)
                                Index Cond: (id = graph_family_links.descendent_concept_id)
                                Filter: (state <> 2)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Please refer to the below sql script.

SELECT DISTINCT graph_concepts.* 
FROM graph_concepts
    INNER JOIN (select descendent_concept_id from graph_family_links where ancestor_concept_id = 1016 and generation = 1) A ON graph_concepts.id = A.descendent_concept_id 
WHERE graph_concepts.state != 2

CodePudding user response：

Explain read from bottom to up, generally.
https://www.postgresql.org/docs/current/sql-explain.html

This command displays the execution plan that the PostgreSQL planner generates for the supplied statement. The execution plan shows how the table(s) referenced by the statement will be scanned — by plain sequential scan, index scan, etc. — and if multiple tables are referenced, what join algorithms will be used to bring together the required rows from each input table.

Since you only do explain your select command meaning the output is the system planner generate a execute plan for this select query. But if you do explain analyze then it will plan and execute the query.

First there is two table there. For each table use some ways to found out which rows meet the where criteria. index scan (one of the way to find out where is the row) found out in table graph_concepts which row meet the condition: graph_concepts.state != 2. Also in the mean time(Parallel) use Bitmap Heap Scan to found out in table graph_family_links which row meet the criteria: graph_family_links.ancestor_concept_id = 1016
After that then do join operation. In this case, it's Nested Loop.
After join then do Sort. Why we need to sort operation? Because you specified: SELECT DISTINCT : https://www.postgresql.org/docs/current/sql-select.html -After sort then since you specified key word DISTINCT then eliminate the duplicates.

CodePudding user response：

Here I suggest looking at relative "cost" which isn't too hard for that explain plan. For a total cost of ~46347.01 most of that (~45347.01 or ~97%) is taken-up by the sort operation to facilitate getting distinct rows.

 Unique  (cost=46347.01..46846.16 rows=4485 width=108)
   ## (Merge records DISTINCT)
   ->  Gather Merge  (cost=46347.01..46825.98 rows=4485 width=108)
         Workers Planned: 1

        ## (Sort table graph_concepts.* )
         ->  Sort  (cost=45347.01..45348.32 rows=2638 width=108)
               Sort Key: graph_concepts.id, graph_concepts.definition, graph_concepts.checkvist_task_id
                       , graph_concepts.primary_question_id, graph_concepts.created_at, graph_concepts.updated_at
                       , graph_concepts.tsn, graph_concepts.state, graph_concepts.search_phrases

With the assumption that you have a unique constraint (graph_concepts.id) the only way to be getting duplicated rows would be through the join - so address that join so that no row multiplication occurs which can be done like the following

SELECT graph_concepts.*
FROM graph_concepts
INNER JOIN (
    SELECT DISTINCT descendent_concept_id
    FROM graph_family_links
    WHERE graph_family_links.ancestor_concept_id = 1016
        AND graph_family_links.generation = 1
    ) AS d ON graph_concepts.id = d.descendent_concept_id
WHERE graph_concepts.STATE != 2

Here only 1 row per graph_concepts.id is returned in the subquery, with that subquery also containing the where clause as it pertains to the graph_family_links table. Thus, this subquery cannot multiply the rows of table graph_concepts and hence select distinct on that table is no longer necessary.

Note also, that the subquery is only returning a single column. In general keeping the number of columns small aids the performance of select distinct.

CodePudding user response：

Ok. Let us change one method.

A nested loop join works like this:

PostgreSQL scans the outer table, in this case b.
For each row found in the outer table, PostgreSQL scans the inner table, in this case a, for matching rows.

Since there is an index on the join condition on the inner table, PostgreSQL uses an index scan there.

So with a nested loop join, only an index on the join condition on the inner table can be used.

There is also a condition on the scan of the outer table (generation = 1), but that has nothing to do with the join. You don't seem to have an index on that column, so a sequential scan is used.

The large number of executions for the inner loop is explained by the 2638 rows found scanning the outer table: for each of these rows, the inner table gets scanned.

Almost all the execution time is spent on that parallel sequential scan, and most of the rows are discarded, so I assume that the following index will make the query considerably faster:


CREATE INDEX ON graph_family_links (generation);