Home > Enterprise >  Performant query count of rows within range over sequence
Performant query count of rows within range over sequence

Time:08-02

I have a SQLite table with an Id and an active period, and I am trying to get counts of the number of active of rows over a sequence of times.

A vastly simplified version of this table is:

CREATE TABLE Data (
    EntityId INTEGER NOT NULL,
    Start    INTEGER NOT NULL,
    Finish   INTEGER
);

With some example data

INSERT INTO Data VALUES 
    (1, 0, 2),
    (1, 4, 6),
    (1, 8, NULL),
    (2, 5, 7),
    (2, 9, NULL),
    (3, 8, NULL);

And an desired output of something like:

Time Count
0 1
1 1
2 0
3 0
4 1
5 2
6 1
7 0
8 2
9 3

For which I am querying with:

WITH RECURSIVE Generate_Time(Time) AS (
    SELECT 0
    UNION ALL
    SELECT Time   1 FROM Generate_Time
    WHERE Time   1 <= (SELECT MAX(Start) FROM Data)
)
SELECT Time, COUNT(EntityId)
FROM Data
JOIN Generate_Time ON Start <= Time AND (Finish > Time OR Finish IS NULL)
GROUP BY Time

There is also some data I need to categorise the counts by (some are on the original table, some are using a join), but I am hitting a performance bottleneck in the order of seconds on even small amounts of data (~25,000 rows) without any of that.

I have added an index on the table covering Start/End:

CREATE INDEX Ix_Data ON Data (
    Start,
    Finish
);

and that helped somewhat but I can't help but feel there's a more elegant & performant way of doing this. Using the CTE to iterate over a range doesn't seem like it will scale very well but I can't think of another way to calculate what I need.

I've been looking at the query plan too, and I think the slow part of the GROUP BY since it can't use an index for that since it's from the CTE so SQLite generates a temporary BTree:

3   0   0   MATERIALIZE 3
7   3   0   SETUP
8   7   0   SCAN CONSTANT ROW
21  3   0   RECURSIVE STEP
22  21  0   SCAN TABLE Generate_Time
27  21  0   SCALAR SUBQUERY 2
32  27  0   SEARCH TABLE Data USING COVERING INDEX Ix_Data
57  0   0   SCAN SUBQUERY 3
59  0   0   SEARCH TABLE Data USING INDEX Ix_Data (Start<?)
71  0   0   USE TEMP B-TREE FOR GROUP BY

Any suggestions of a way to speed this query up, or even a better way of storing this data to craft a tighter query would be most welcome!

CodePudding user response:

To get to the desired output as per your question, the following can be done. For better performance, on option is to make use of generate_series to generate rows instead of the recursive CTE and limit the number of rows to the max-value available in data.

WITH RECURSIVE Generate_Time(Time) AS (
    SELECT 0
    UNION ALL
    SELECT Time   1 FROM Generate_Time
    WHERE Time   1 <= (SELECT MAX(Start) FROM Data)
)
   SELECT gt.Time
          ,count(d.entityid)
     FROM Generate_Time gt
LEFT JOIN Data d
       ON gt.Time between d.start and IFNULL(d.finish,gt.Time)
 GROUP BY gt.Time       
  

CodePudding user response:

This ended up being simply a case of the result set being too large. In my real data, the result set before grouping was ~19,000,000 records. I was able to do some partitioning on my client side, splitting the queries into smaller discrete chunks which improved performance ~10x, which still wasn't quite as fast as I wanted but was acceptable for my use case.

  • Related