Home > database >  PostgreSQL Nested Loop Join Performance
PostgreSQL Nested Loop Join Performance

Time:05-13

I have two tables exchange_rate (100 Thousand Rows) and paid_date_t (9 million rows) with below structure.

                           "exchange_rate"
           Column            |           Type           | Collation | Nullable | Default 
----------------------------- -------------------------- ----------- ---------- ---------
 valid_from                  | timestamp with time zone |           |          | 
 valid_until                 | timestamp with time zone |           |          | 
 currency                    | text                     |           |          | 
Indexes:
    "exchange_rate_unique_valid_from_currency_key" UNIQUE, btree (valid_from, currency)
    "exchange_rate_valid_from_gist_idx" gist (valid_from)
    "exchange_rate_valid_from_until_currency_gist_idx" gist (valid_from, valid_until, currency)
    "exchange_rate_valid_from_until_gist_idx" gist (valid_from, valid_until)
    "exchange_rate_valid_until_gist_idx" gist (valid_until) 

                       "paid_date_t"
      Column       |            Type             | Collation | Nullable | Default 
------------------- ----------------------------- ----------- ---------- ---------
 currency          | character varying(3)        |           |          | 
 paid_date         | timestamp without time zone |           |          | 

Indexes:
    "paid_date_t_paid_date_idx" btree (paid_date)

I am running below select query and joining these tables based on multiple join keys:

SELECT
 paid_date
FROM     exchange_rate erd
       JOIN paid_date_t sspd
             ON sspd.paid_date >= erd.valid_from AND sspd.paid_date < erd.valid_until
                 AND erd.currency = sspd.currency
WHERE sspd.currency != 'USD'

However, the performance of the query is inefficient and takes hours to execute. The query plan below shows that it using a nested loop join.

 Nested Loop  (cost=0.28..44498192.71 rows=701389198 width=40)
   ->  Seq Scan on paid_date_t sspd  (cost=0.00..183612.84 rows=2557615 width=24)
         Filter: ((currency)::text <> 'USD'::text)
   ->  Index Scan using exchange_rate_valid_from_until_currency_gist_idx on exchange_rate erd  (cost=0.28..16.53 rows=80 width=36)
         Index Cond: (currency = (sspd.currency)::text)
         Filter: ((sspd.paid_date >= valid_from) AND (sspd.paid_date < valid_until))

I have worked with different indexing methods but got the same result. I know that <= and >= operators are not supporting merge or hash joins.

Any ideas are appreciated.

CodePudding user response:

You should create a smaller table with just a sample of the rows from paid_date_t in it. It is hard to optimize a query if it takes a very long time each time you try to test it.

Your btree index has the column tested for equality as the 2nd column, which is certainly less efficient. The better btree index for this query (as it is currently written) would be something like (currency, valid_from, valid_until).

For a gist index, you really want it to be on the time range, not on the separate end points of the range. You could either convert the table to hold a range type, or build a functional index to convert them on the fly (and then rewrite the query to use the same expression). This is complicated by the fact that your tables have different types due to the different handling of time zones. The index would look like:

create index on exchange_rate using gist (tstzrange(valid_from,valid_until), currency);

and then the ON condition would look like:

ON sspd.paid_date::timestamptz <@ tstzrange(erd.valid_from, erd.valid_until)
             AND erd.currency = sspd.currency

It might be faster to have the order of the columns in the gist index be reversed from what I show, you should try it both ways on your own data and see.

  • Related