Home > OS >  Why is this query too slow? Getting list of A with one associated foreign-keyed B for each item
Why is this query too slow? Getting list of A with one associated foreign-keyed B for each item

Time:10-30

I have Website(Id) table, each record may have multiple CheckLog(FK WebsiteId) entries associated. CheckLog also has a compound index of [WebsiteId, CreatedTime]. Website has only around 20 records but overtime CheckLog would grow, 3 millions entries at the time I have this problem. (See schema using EF Core at the end of the question).

A frequent query I have is to query the list of all Websites, along with zero/one latest CheckLog record:

return await this.ctx.Websites.AsNoTracking()
    .Select(q => new  WebsiteListItem()
    {
        Website = q,
        LatestCheckLog = q.CheckLogs
            .OrderByDescending(q => q.CreatedTime)
            .FirstOrDefault(),
    })
    .ToListAsync();

I believe the [WebsiteId, CreatedTime] index should help. However, the query takes around 11s to execute. Here's the translated query, along with EXPLAIN QUERY PLAN:

      SELECT "w"."Id", "t0"."Id", "t0"."CreatedTime", "t0"."WebsiteId"
      FROM "Websites" AS "w"
      LEFT JOIN (
          SELECT "t"."Id", "t"."CreatedTime", "t"."WebsiteId"
          FROM (
              SELECT "c"."Id", "c"."CreatedTime", "c"."WebsiteId", ROW_NUMBER() OVER(PARTITION BY "c"."WebsiteId" ORDER BY "c"."CreatedTime" DESC) AS "row"
              FROM "CheckLogs" AS "c"
          ) AS "t"
          WHERE "t"."row" <= 1
      ) AS "t0" ON "w"."Id" = "t0"."WebsiteId"

enter image description here

MATERIALIZE 1
CO-ROUTINE 4
SCAN TABLE CheckLogs AS c USING INDEX IX_CheckLogs_WebsiteId_CreatedTime
USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
SCAN SUBQUERY 4
SCAN TABLE Websites AS w
SEARCH SUBQUERY 1 AS t USING AUTOMATIC COVERING INDEX (WebsiteId=?)

Is this fixable with Index? If not, is there an efficient way to query it without creating N 1 queries? I tried to think of a way to do that with 2 queries but can't think of any better way to translate it the way EF Core does).

Also I believe this is a very common problem but I don't know what keyword I should use to find out solution for this kind of problem. I am okay with a general solution for this kind of problem (i.e. get the latest Product of a list of Categories). Thank you.


I use EF Core for DB Schema:

    public class Website
    {

        public int Id { get; set; }

        // Other properties

        public ICollection<CheckLog> CheckLogs { get; set; }

    }

    [Index(nameof(CreatedTime))]
    [Index(nameof(WebsiteId), nameof(CreatedTime))]
    public class CheckLog
    {
        public int Id { get; set; }
        public DateTime CreatedTime { get; set; }

        public int WebsiteId { get; set; }
        public Website Website { get; set; }

        // Other properties

    }

CodePudding user response:

If what you want is to get the row with the latest CreatedTime for each WebsiteId then there is no need for any join.

Just aggregate and set the condition:

HAVING MAX(CreatedTime)

This is not standard SQL, but utilizes SQLite's bare columns:

SELECT *
FROM CheckLogs 
GROUP BY WebsiteId
HAVING MAX(CreatedTime);

If you want to join it to Websites:

SELECT  w.Id, t.Id, t.CreatedTime, t.WebsiteId 
FROM Websites AS w 
LEFT JOIN (
  SELECT *
  FROM CheckLogs 
  GROUP BY WebsiteId
  HAVING MAX(CreatedTime)
) AS t ON w.Id = t.WebsiteId;

CodePudding user response:

Thanks to this answer, I found out how to rewrite the query:

SELECT L.*
FROM CheckLogs L
INNER JOIN
(SELECT WebsiteId, Max(CreatedTime) AS CreatedTime
FROM CheckLogs
GROUP BY WebsiteId) L2
ON L.WebsiteId = L2.WebsiteId AND L.CreatedTime = L2.CreatedTime

Since EF Core cannot translate JOIN query, I rewrote the code this way, 2 round trip to database server. Note that since WebsiteIds are numbers, it's safe to put it this way, if you put in string parameters, you need to sanitize them.

            var websites = await query.ToListAsync();
            var websiteIds = websites.Select(q => q.Id).ToList();

            var websiteIdsString = string.Join(",", websiteIds);
            var logQuery = this.ctx.CheckLogs.FromSqlRaw(@$"
SELECT L.*
FROM CheckLogs L
INNER JOIN
(SELECT WebsiteId, Max(CreatedTime) AS CreatedTime
FROM CheckLogs
GROUP BY WebsiteId) L2
ON L.WebsiteId = L2.WebsiteId AND L.CreatedTime = L2.CreatedTime
WHERE L.WebsiteId IN ({websiteIdsString})");

            var logs = await logQuery.AsNoTracking().ToListAsync();
            var logDict = logs.ToLookup(q => q.WebsiteId);

            return websites.Select(q => new WebsiteListItem()
            {
                Website = q,
                LatestCheckLog = logDict[q.Id].FirstOrDefault(),
            }).ToList();

CodePudding user response:

First try this query:

SELECT MAX("c"."CreatedTime"), "c"."WebsiteId" FROM "CheckLogs" AS "c" GROUP BY "c"."WebsiteId"

It will be hard to optimize your query to be faster than the above. If it is fast, then there is hope.

you could try:

with logs as (
    select max("c"."Id") "Id", max("c"."CreatedTime") "CreatedTime", "c"."WebsiteId"
    from "CheckLogs" AS "c"
    group by "c"."WebsiteId"
    having count(*) = 1)

select "w"."Id", "l"."Id", "l"."CreatedTime", "l"."WebsiteId"
from "Websites" AS "w"
join logs as "l" on "w"."Id" = "l"."WebsiteId"

union all

select "w"."Id", null, null, null
from "Websites" AS "w"
where not exists (
    select 'x'
    from "CheckLogs" AS "l"
    where "l"."WebsiteId" = "w"."Id")

You could also try the two halves of union all above to see how fast they are. If still slow you can add a new table with the result you are looking for and create trigger on "CheckLogs" to populate it, so the data is ready.

  • Related