Optimsing a Prisma based PostgreSQL query with indexes or DB setttings-CodePudding

I'm trying to optimize a pagination query that runs on a table of Videos joined with Channels for a project that uses the Prism ORM.

I don't have a lot of experience adding database indexes to optimize performance and could use some guidance on whether I missed something obvious or I'm simply constrained by the database server hardware.

When extracted, the query that Prisma runs looks like this and takes at least 4-10 seconds to run on 136k videos even after I added some indexes:

SELECT
    "public"."Video"."id",
    "public"."Video"."youtubeId",
    "public"."Video"."channelId",
    "public"."Video"."type",
    "public"."Video"."status",
    "public"."Video"."reviewed",
    "public"."Video"."category",
    "public"."Video"."youtubeTags",
    "public"."Video"."language",
    "public"."Video"."title",
    "public"."Video"."description",
    "public"."Video"."duration",
    "public"."Video"."durationSeconds",
    "public"."Video"."viewCount",
    "public"."Video"."likeCount",
    "public"."Video"."commentCount",
    "public"."Video"."scheduledStartTime",
    "public"."Video"."actualStartTime",
    "public"."Video"."actualEndTime",
    "public"."Video"."sortTime",
    "public"."Video"."createdAt",
    "public"."Video"."updatedAt",
    "public"."Video"."publishedAt"
FROM
    "public"."Video",
    (
        SELECT
            "public"."Video"."sortTime" AS "Video_sortTime_0"
        FROM
            "public"."Video"
        WHERE ("public"."Video"."id") = (29949)
    ) AS "order_cmp"
WHERE (
    ("public"."Video"."id") IN(
        SELECT
            "t0"."id" FROM "public"."Video" AS "t0"
            INNER JOIN "public"."Channel" AS "j0" ON ("j0"."id") = ("t0"."channelId")
        WHERE (
            (NOT "j0"."status" IN('HIDDEN', 'ARCHIVED'))
            AND "t0"."id" IS NOT NULL)
    )
    AND "public"."Video"."status" IN('UPCOMING', 'LIVE', 'PUBLISHED')
    AND "public"."Video"."sortTime" <= "order_cmp"."Video_sortTime_0")
ORDER BY
    "public"."Video"."sortTime" DESC OFFSET 0;

I haven't yet defined the indexes in my Prisma schema file, I've just been setting them directly on the database. These are the current indexes on the Video and Channel tables:

CREATE UNIQUE INDEX "Video_youtubeId_key" ON public."Video" USING btree ("youtubeId") CREATE INDEX "Video_status_idx" ON public."Video" USING btree (status)
CREATE INDEX "Video_sortTime_idx" ON public."Video" USING btree ("sortTime" DESC)
CREATE UNIQUE INDEX "Video_pkey" ON public."Video" USING btree (id)
CREATE INDEX "Video_channelId_idx" ON public."Video" USING btree ("channelId")

CREATE UNIQUE INDEX "Channel_youtubeId_key" ON public."Channel" USING btree ("youtubeId")
CREATE UNIQUE INDEX "Channel_pkey" ON public."Channel" USING btree (id)

EXPLAIN (ANALYZE,BUFFERS) for the query shows this (analyzer tool link):

Sort  (cost=114760.67..114867.67 rows=42801 width=1071) (actual time=4115.144..4170.368 rows=13943 loops=1)
  Sort Key: ""Video"".""sortTime"" DESC
  Sort Method: external merge  Disk: 12552kB
  Buffers: shared hit=19049 read=54334 dirtied=168, temp read=1569 written=1573
  I/O Timings: read=11229.719
  ->  Nested Loop  (cost=39030.38..91423.62 rows=42801 width=1071) (actual time=2720.873..4037.549 rows=13943 loops=1)
        Join Filter: (""Video"".""sortTime"" <= ""Video_1"".""sortTime"")
        Rows Removed by Join Filter: 115529
        Buffers: shared hit=19049 read=54334 dirtied=168
        I/O Timings: read=11229.719
        ->  Index Scan using ""Video_pkey"" on ""Video"" ""Video_1""  (cost=0.42..8.44 rows=1 width=8) (actual time=0.852..1.642 rows=1 loops=1)
              Index Cond: (id = 29949)
              Buffers: shared hit=2 read=2
              I/O Timings: read=0.809
        ->  Gather  (cost=39029.96..89810.14 rows=128404 width=1071) (actual time=2719.274..4003.170 rows=129472 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              Buffers: shared hit=19047 read=54332 dirtied=168
              I/O Timings: read=11228.910
              ->  Parallel Hash Semi Join  (cost=38029.96..75969.74 rows=53502 width=1071) (actual time=2695.849..3959.412 rows=43157 loops=3)
                    Hash Cond: (""Video"".id = t0.id)
                    Buffers: shared hit=19047 read=54332 dirtied=168
                    I/O Timings: read=11228.910
                    ->  Parallel Seq Scan on ""Video""  (cost=0.00..37202.99 rows=53938 width=1071) (actual time=0.929..1236.450 rows=43157 loops=3)
                          Filter: (status = ANY ('{UPCOMING,LIVE,PUBLISHED}'::""VideoStatus""[]))
                          Rows Removed by Filter: 3160
                          Buffers: shared hit=9289 read=27118
                          I/O Timings: read=3526.407
                    ->  Parallel Hash  (cost=37312.18..37312.18 rows=57422 width=4) (actual time=2692.172..2692.180 rows=46084 loops=3)
                          Buckets: 262144  Batches: 1  Memory Usage: 7520kB
                          Buffers: shared hit=9664 read=27214 dirtied=168
                          I/O Timings: read=7702.502
                          ->  Hash Join  (cost=173.45..37312.18 rows=57422 width=4) (actual time=3.485..2666.998 rows=46084 loops=3)
                                Hash Cond: (t0.""channelId"" = j0.id)
                                Buffers: shared hit=9664 read=27214 dirtied=168
                                I/O Timings: read=7702.502
                                ->  Parallel Seq Scan on ""Video"" t0  (cost=0.00..36985.90 rows=57890 width=8) (actual time=1.774..2646.207 rows=46318 loops=3)
                                      Filter: (id IS NOT NULL)
                                      Buffers: shared hit=9193 read=27214 dirtied=168
                                      I/O Timings: read=7702.502
                                ->  Hash  (cost=164.26..164.26 rows=735 width=4) (actual time=1.132..1.136 rows=735 loops=3)
                                      Buckets: 1024  Batches: 1  Memory Usage: 34kB
                                      Buffers: shared hit=471
                                      ->  Seq Scan on ""Channel"" j0  (cost=0.00..164.26 rows=735 width=4) (actual time=0.024..0.890 rows=735 loops=3)
                                            Filter: (status <> ALL ('{HIDDEN,ARCHIVED}'::""ChannelStatus""[]))
                                            Rows Removed by Filter: 6
                                            Buffers: shared hit=471
Planning Time: 8.134 ms
Execution Time: 4173.202 ms

Now a hint from that same tool seems to suggest that it needed to use disk space for sorting, since my work_mem setting was too low (needed 12560kB and on my Lightsail Postgres DB with 1 gig of RAM, it's set to '4M').

I'm a bit nervous about bumping work_mem to something like 16 or even 24M on a whim. Is that too much for my server's total RAM? Does this look like my root problem? Is there anything else I can do with indexes or my query?

If it helps, the actual Prisma query looks like this

  const videos = await ctx.prisma.video.findMany({
    where: {
      channel: {
        NOT: {
          status: {
            in: [ChannelStatus.HIDDEN, ChannelStatus.ARCHIVED],
          },
        },
      },
      status: {
        in: [VideoStatus.UPCOMING, VideoStatus.LIVE, VideoStatus.PUBLISHED],
      },
    },
    include: {
      channel: {
        include: {
          links: true,
        },
      },
    },
    cursor: _args.cursor
      ? {
          id: _args.cursor,
        }
      : undefined,
    skip: _args.cursor ? 1 : 0,
    orderBy: {
      sortTime: 'desc',
    },
    take: Math.min(_args.limit, config.GRAPHQL_MAX_RECENT_VIDEOS),
  });

Even if I eliminate the join with the Channel table from the Prisma query entirely, the performance doesn't improve by much and a query still takes 7-8 seconds to run.

CodePudding user response：

This query is a bit of an ORM generated nested-select mess. Nested-selects get in the way of the optimizer. Joins are usually better.

If written by hand, the query would be something like this.

select *
from video
join channel on channel.id = video.channelId
where video.status in('UPCOMING', 'LIVE', 'PUBLISHED')
  -- Not clear what this is for, might be wacky pagination?
  and video.sortTime <= (
      select sortTime from video where id = 29949
  )
  and not channel.status in('HIDDEN', 'ARCHIVED')
order by sortTime desc
offset 0
limit 100

Pretty straight-forward. Much easier to understand and optimize.

Same as below, this query would benefit from a single composite index on sortTime, status.

And since you're paginating, using limit to only get as many rows as you need in a page can drastically help with performance. Otherwise Postgres will do all the work to calculate all rows.

The performance is getting killed by multiple sequential scans of video.

                ->  Parallel Seq Scan on ""Video""  (cost=0.00..37202.99 rows=53938 width=1071) (actual time=0.929..1236.450 rows=43157 loops=3)
                          Filter: (status = ANY ('{UPCOMING,LIVE,PUBLISHED}'::""VideoStatus""[]))
                          Rows Removed by Filter: 3160
                          Buffers: shared hit=9289 read=27118
                          I/O Timings: read=3526.407

                          ->  Parallel Seq Scan on ""Video"" t0  (cost=0.00..36985.90 rows=57890 width=8) (actual time=1.774..2646.207 rows=46318 loops=3)
                                      Filter: (id IS NOT NULL)
                                      Buffers: shared hit=9193 read=27214 dirtied=168
                                      I/O Timings: read=7702.502

Looking at the where clause...

WHERE (
    ("public"."Video"."id") IN(
        SELECT
            "t0"."id" FROM "public"."Video" AS "t0"
            INNER JOIN "public"."Channel" AS "j0" ON ("j0"."id") = ("t0"."channelId")
        WHERE (
            (NOT "j0"."status" IN('HIDDEN', 'ARCHIVED'))
            AND "t0"."id" IS NOT NULL)
    )
    AND "public"."Video"."status" IN('UPCOMING', 'LIVE', 'PUBLISHED')
    AND "public"."Video"."sortTime" <= "order_cmp"."Video_sortTime_0")

But you have indexes on video.id and video.status. Why is it doing a seq scan?

In general, Postgres will only use one index per query. Your query needs to check three columns: id, status, and sortTime. Postgres can only use one index, so it uses the one on sortTime and has to seq scan for the rest.

To solve this, try creating a single composite index on both sortTime and status. This will allow Postgres to use an index for both the status and sortTime parts of the query.

create index video_sortTime_status_idx on video (sortTime, status)

With this index the separate sortTime index is no longer necessary, drop it.

The second seq scan is from "t0"."id" IS NOT NULL. "t0" is the Video table. "id" is its primary key. It should be impossible for a primary key to be null, so remove that.

CodePudding user response：

I don't think any index or db setting is going to improve your existing query by much.

Two small changes to the existing query does get it to use the "sortTime" index for ordering, but I don't know if you can influence Prisma to make the changes. One is to add the explicit LIMIT (although I don't know if that is necessary, I don't know how to test it with the "take" method instead of the LIMIT method), and the other is to move the 29949 subquery out of the join and put it directly into the WHERE.

AND "public"."Video"."sortTime" <= (
    SELECT
        "public"."Video"."sortTime" AS "Video_sortTime_0"
    FROM
        "public"."Video"
    WHERE ("public"."Video"."id") = (29949)
)

But if Prisma allows you to inject custom queries, I would just rewrite it from scratch along the lines Schwern has suggested.

An improvement in the PostgreSQL planner might get it to work without moving the subquery, but even if we knew exactly what to change and had a high-quality implementation of it and could convince people it was a trade-off free improvement, it would still not be released for over a year (in v16), so it wouldn't help you immediately and I wouldn't have much hope for getting it accepted anyway.