PostgreSQL query optimization challenge-CodePudding

I am trying to optimize this query:

SELECT eq.*,
    reg_last_dt.dt as reg_last_date
FROM Equipment eq
  INNER JOIN (
    select max( dt ) as dt, id_eq_equipment 
    from consum 
    group by id_eq_equipment
  ) as reg_last_dt ON reg_last_dt.id_eq_equipment = eq.id_eq

Explain shows me this:

Hash Join  (cost=839806.69..839833.33 rows=23 width=1461)
  Hash Cond: (eq.id_eq = consum.id_eq_equipment)
  ->  Seq Scan on equipment eq  (cost=0.00..26.29 rows=129 width=1453)
  ->  Hash  (cost=839806.40..839806.40 rows=23 width=10)
        ->  Finalize GroupAggregate  (cost=839805.60..839806.17 rows=23 width=10)
              Group Key: consum.id_eq_equipment
              ->  Sort  (cost=839805.60..839805.71 rows=46 width=10)
                    Sort Key: consum.id_eq_equipment
                    ->  Gather  (cost=839799.50..839804.33 rows=46 width=10)
                          Workers Planned: 2
                          ->  Partial HashAggregate  (cost=838799.50..838799.73 rows=23 width=10)
                                Group Key: consum.id_eq_equipment
                                ->  Parallel Seq Scan on consum  (cost=0.00..755192.33 rows=16721433 width=10)

This looks not very optimal. Is there anything I could do to make it better?

CodePudding user response：

The row estimates in the query plan (only rows=129 for Equipment, and only rows=23 for aggregated consum) indicate that this query using a LATERAL subquery instead should perform much faster:

SELECT eq.*, r.reg_last_date
FROM   Equipment eq
CROSS  JOIN LATERAL (
   SELECT max(dt) AS reg_last_date
   FROM   consum c
   WHERE  c.id_eq_equipment = eq.id_eq
   ) r;

Be sure to have a multicolumn index on consum(id_eq_equipment, dt)!

Optimize GROUP BY query to retrieve latest row per user

Maybe your really want a LEFT JOIN to return all rows from Equipment? See:

What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

CodePudding user response：

If the estimates in the plan are correct, it would almost certainly be faster to do it with a subselect, this way:

SELECT 
    eq.*, 
    (select max( dt ) from consum where consum.id_eq_equipment = eq.id_eq) as reg_last_date
FROM Equipment eq

Note this will return NULL for reg_last_date where there is no corresponding record in consum, so you might want to filter those out if you don't want to see them.

You would need an index on (id_eq_equipment, dt) to make it fast