I'm trying to select the orders that are part of a trip with multiple orders.
I tried many approaches but can't find how to have a performant query.
To reproduce the problem here is the setup (here it's 100 000 rows, but really it's more 1 000 000 rows to see the timeout on db-fiddle).
Schema (PostgreSQL v14)
create table trips (id bigint primary key);
create table orders (id bigint primary key, trip_id bigint);
create index trips_idx on trips (id);
create index orders_idx on orders (id);
create index orders_trip_idx on orders (trip_id);
insert into trips (id) select seq from generate_series(1,100000) seq;
insert into orders (id, trip_id) select seq, floor(random() * 100000 1) from generate_series(1,100000) seq;
Query #1
explain analyze select orders.id
from orders
inner join trips on trips.id = orders.trip_id
inner join orders trips_orders on trips_orders.trip_id = trips.id
group by orders.id, trips.id
having count(trips_orders) > 1
limit 50
;
CodePudding user response:
Do you actually need the join on trips? You could try
SELECT shared.id
FROM orders shared
WHERE EXISTS (SELECT * FROM orders other
WHERE other.trip_id = shared.trip_id
AND other.id != shared.id
)
;
to replace the group by with a hash join, or
SELECT unnest(array_agg(orders.id))
FROM orders
GROUP BY trip_id
HAVING count(*) > 1
;
to hopefully get Postgres to just use the trip_id index.