Avoid full table scan under window function run with use of join condition-CodePudding

Given a database table events with columns: event_id, correlation_id, username, create_timestamp. It contains more than 1M records.

There is a problem I am trying to solve: for each event of a particular user display its latest sibling event. Sibling events are the events which has same correlation_id. The query I use for that is the following:

SELECT 
  "events"."event_id" AS "event_id", 
  "latest"."event_id" AS "latest_event_id" 
FROM 
  events "events" 
  JOIN (
    SELECT 
      "latest"."correlation_id" AS "correlation_id", 
      "latest"."event_id" AS "event_id", 
      ROW_NUMBER () OVER (
        PARTITION BY "latest"."correlation_id" 
        ORDER BY 
          "latest"."create_timestamp" ASC
      ) AS "rn" 
    FROM 
      events "latest"
  ) "latest" ON (
    "latest"."correlation_id" = "events"."correlation_id" 
    AND "latest"."rn" = 1
  ) 
WHERE 
  "events"."username" = 'user1'

It gets correct list of results but causes performance problems which must be fixed. Here is an execution plan of the query:

Hash Right Join  (cost=13538.03..15522.72 rows=1612 width=64)
  Hash Cond: (("latest".correlation_id)::text = ("events".correlation_id)::text)
  ->  Subquery Scan on "latest"  (cost=12031.35..13981.87 rows=300 width=70)
        Filter: ("latest".rn = 1)
        ->  WindowAgg  (cost=12031.35..13231.67 rows=60016 width=86)
              ->  Sort  (cost=12031.35..12181.39 rows=60016 width=78)
                    Sort Key: "latest_1".correlation_id, "latest_1".create_timestamp
                    ->  Seq Scan on events "latest_1"  (cost=0.00..7268.16 rows=60016 width=78)
  ->  Hash  (cost=1486.53..1486.53 rows=1612 width=70)
        ->  Index Scan using events_username on events "events" (cost=0.41..1486.53 rows=1612 width=70)
              Index Cond: ((username)::text = 'user1'::text)

From the plan, I can conclude that the performance problem mainly caused by calculation of latest events for ALL events in the table which takes ~80% of cost. Also it performs the calculations even if there are no events for a user at all. Ideally, I would like the query to do these steps which seem more efficient for me:

find all events by a user
for each event from Step 1, find all siblings, sort them and get the 1st

To simplify the discussion, let's consider all required indexes as already created for needed columns. It doesn't seem to me that the problem can be solved purely by index creation.

Any ideas what can be done to improve the performance? Probably, there are options to rewrite the query or adjust a configuration of the table.

Note that this question is significantly obfuscated in terms of business meaning to clearly demonstrate the technical problem I face.

CodePudding user response：

The window function has to scan the whole table. It has no idea that really you are only interested in the first value. A lateral join could perform better and is more readable anyway:

ELECT 
  e.event_id, 
  latest.latest_event_id
FROM 
  events AS e
  CROSS JOIN LATERAL
     (SELECT
        l.event_id AS latest_event_id
      FROM
        events AS l
      WHERE
        l.correlation_id = e.correlation_id 
      ORDER BY l.create_timestamp
      FETCH FIRST 1 ROWS ONLY
     ) AS latest;

The perfect index to support that would be

CREATE INDEX ON event (correlation_id, create_timestamp);

CodePudding user response：

One option that may improve efficiency is to rewrite the query filtering "rn" = 1 beforehand to reduce resulting rows when joining tables:

WITH "latestCte"("correlation_id", "event_id") as (SELECT 
  "correlation_id", 
  "event_id", 
  ROW_NUMBER () OVER (
    PARTITION BY "correlation_id" 
    ORDER BY 
      "create_timestamp" ASC
  ) AS "rn" 
FROM 
  events)
SELECT 
  "events"."event_id" AS "event_id", 
  "latest"."event_id" AS "latest_event_id" 
FROM 
  events "events" 
  JOIN (
    SELECT "correlation_id", "event_id" FROM "latestCte" WHERE "rn" = 1
  ) "latest" ON (
    "latest"."correlation_id" = "events"."correlation_id" 
  ) 
WHERE 
  "events"."username" = 'user1'

Hope it helps, also I am curious to see the resulting execution plan of this query. Best regards.

CodePudding user response：

Without having access to the data, I'm really just throwing out ideas...

Instead of a subquery, it's worth trying a materialized CTE
Rather than the row_number analytic, you can try a distinct on. Honestly, I don't predict any gains. It's basically the same thing at the database level

Sample of both:

with latest as materialized (
    SELECT distinct on ("correlation_id")
      "correlation_id", "event_id" 
    FROM events
    order by
      "correlation_id", "create_timestamp" desc
)
SELECT 
  e."event_id", 
  l."event_id" AS "latest_event_id" 
FROM 
  events "events" e
  join latest l ON
    l."correlation_id" = e."correlation_id" 
WHERE 
  e."username" = 'user1'

Additional suggestion -- if you are doing this over and over, I'd consider creating a temp table or materialized view for "latest," including an index by coorelation_id rather than re-running the subquery (or CTE) every single time. This will be a one-time pain followed my repeated gain.

Yet one more suggestion -- if at all possible, drop the double quotes from your object names. Maybe it's just me, but I find them brutal. Unless you have spaces, reserved words or mandatory uppercase in your field names (please don't do that), then these create more problems than they solve. I kept them in the query I listed above, but it pained me.

And this last comment goes back to knowing your data... since row_number and distinct on are relatively expensive operations, it may make sense to make your subquery/cte more selective by introducing the "user1" constraint. This is completely untested, but something like this:

SELECT distinct on (e1.correlation_id)
  e1.correlation_id, e1.event_id
FROM events e1
join events e2 on
  e1.correlation_id = e2.correlation_id and
  e2.username = 'user1'
order by
  e1.correlation_id, e1.create_timestamp desc

CodePudding user response：

All those needless double quotes are making my eyes bleed.

This should very fast with a lateral join, provided the number of returned rows is rather low, i.e. 'user1' is rather specific.

explain analyze SELECT 
  events.event_id AS event_id, 
  latest.event_id AS latest_event_id 
FROM 
  events "events" 
  cross JOIN lateral (
    SELECT 
      latest.event_id AS event_id 
      FROM events latest
      WHERE latest.correlation_id=events.correlation_id 
      ORDER by create_timestamp ASC limit 1
  ) latest 
WHERE 
  events.username = 'user1';

You will want an index on username, and one on (correlation_id, create_timestamp)

If the number of rows returned is large, then your current query, which precomputes in bulk, is probably better. But would be faster if you used DISTINCT ON rather than the window function to pull out just the latest for each correlation_id. Unfortunately the planner does not understand these queries to be equivalent, and so will not interconvert between them based on what it thinks will be faster.