Home > Software design >  Postgres FDW Remote only JOIN causes fetch all data
Postgres FDW Remote only JOIN causes fetch all data

Time:05-24

I'm trying to fetch data from a remote server with JOIN clause which involves only remote tables, but it is very slow because the planner decides to fetch all data from two tables and merge it locally. When I add WHERE clause it fixes the problem and JOIN is executed fully on a remote server.

Problem is reproducible on small example:

-- remote server
create table test
(
    id   serial
        constraint test_pk
            primary key,
    name text
);
create table test2
(
    test_id int
        constraint test2_test_id_fk
            references test (id),
    info    text
);

SELECT Query:

SELECT "test".id  FROM "test" JOIN "test2" ON "test"."id"="test2".test_id;

Output from EXPLAIN VERBOSE (on empty tables!):

Merge Join  (cost=732.29..1388.59 rows=42778 width=4)
  Output: test.id
  Merge Cond: (test.id = test2.test_id)
  ->  Sort  (cost=366.15..373.46 rows=2925 width=4)
        Output: test.id
        Sort Key: test.id
        ->  Foreign Scan on public.test  (cost=100.00..197.75 rows=2925 width=4)
              Output: test.id
              Remote SQL: SELECT id FROM public.test
  ->  Sort  (cost=366.15..373.46 rows=2925 width=4)
        Output: test2.test_id
        Sort Key: test2.test_id
        ->  Foreign Scan on public.test2  (cost=100.00..197.75 rows=2925 width=4)
              Output: test2.test_id
              Remote SQL: SELECT test_id FROM public.test2

After adding WHERE test.id=1

Foreign Scan  (cost=100.00..198.75 rows=225 width=4)
  Output: test.id
  Relations: (public.test) INNER JOIN (public.test2)
  Remote SQL: SELECT r1.id FROM (public.test r1 INNER JOIN public.test2 r2 ON (((r2.test_id = 1)) AND ((r1.id = 1))))

I'm using AWS RDS Postgres v10.18 on both sides.

What is going on? How to force execution on a remote server? I didn't find anything with that problem.

Thanks for any help.

CodePudding user response:

PostgreSQL estimates that the join result will consist of 42778 rows, so it thinks it more efficient to join the tables locally rather than transferring the big result set.

If that estimate is not correct, ANALYZE both foreign tables to get accurate statistics, then try again. Remember that foreign tables are not analyzed automatically.

In general, when asking performance questions, always include EXPLAIN (ANALYZE, BUFFERS) output.

CodePudding user response:

PostgreSQL has no idea how much data it will find in those tables, and its completely arbitrary guess is not very good.

You can help it out by doing this:

alter server fdw options (add use_remote_estimate 'on');

Planning will take substantially longer because it needs to make multiple roundtrips to the foreign server to do the planning, but IME that is usually well worth it.

You can instead ANALYZE the foreign tables on the local side, so that it stores the stats locally. Planning time should not suffer as much as with use_remote_estimate. You would need to repeat occasionally, as they will not be recomputed automatically. I've had poor experiences with this, but that was several releases ago so maybe it has improved.

Either one fixes your reproducer case for me

  • Related