Performant query count of rows within range over sequence-CodePudding

I have a SQLite table with an Id and an active period, and I am trying to get counts of the number of active of rows over a sequence of times.

A vastly simplified version of this table is:

CREATE TABLE Data (
    EntityId INTEGER NOT NULL,
    Start    INTEGER NOT NULL,
    Finish   INTEGER
);

With some example data

INSERT INTO Data VALUES 
    (1, 0, 2),
    (1, 4, 6),
    (1, 8, NULL),
    (2, 5, 7),
    (2, 9, NULL),
    (3, 8, NULL);

And an desired output of something like:

Time	Count
0	1
1	1
2	0
3	0
4	1
5	2
6	1
7	0
8	2
9	3

For which I am querying with:

WITH RECURSIVE Generate_Time(Time) AS (
    SELECT 0
    UNION ALL
    SELECT Time   1 FROM Generate_Time
    WHERE Time   1 <= (SELECT MAX(Start) FROM Data)
)
SELECT Time, COUNT(EntityId)
FROM Data
JOIN Generate_Time ON Start <= Time AND (Finish > Time OR Finish IS NULL)
GROUP BY Time

There is also some data I need to categorise the counts by (some are on the original table, some are using a join), but I am hitting a performance bottleneck in the order of seconds on even small amounts of data (~25,000 rows) without any of that.

I have added an index on the table covering Start/End:

CREATE INDEX Ix_Data ON Data (
    Start,
    Finish
);

and that helped somewhat but I can't help but feel there's a more elegant & performant way of doing this. Using the CTE to iterate over a range doesn't seem like it will scale very well but I can't think of another way to calculate what I need.

I've been looking at the query plan too, and I think the slow part of the GROUP BY since it can't use an index for that since it's from the CTE so SQLite generates a temporary BTree:

3   0   0   MATERIALIZE 3
7   3   0   SETUP
8   7   0   SCAN CONSTANT ROW
21  3   0   RECURSIVE STEP
22  21  0   SCAN TABLE Generate_Time
27  21  0   SCALAR SUBQUERY 2
32  27  0   SEARCH TABLE Data USING COVERING INDEX Ix_Data
57  0   0   SCAN SUBQUERY 3
59  0   0   SEARCH TABLE Data USING INDEX Ix_Data (Start<?)
71  0   0   USE TEMP B-TREE FOR GROUP BY

Any suggestions of a way to speed this query up, or even a better way of storing this data to craft a tighter query would be most welcome!

CodePudding user response：

To get to the desired output as per your question, the following can be done. For better performance, on option is to make use of generate_series to generate rows instead of the recursive CTE and limit the number of rows to the max-value available in data.

WITH RECURSIVE Generate_Time(Time) AS (
    SELECT 0
    UNION ALL
    SELECT Time   1 FROM Generate_Time
    WHERE Time   1 <= (SELECT MAX(Start) FROM Data)
)
   SELECT gt.Time
          ,count(d.entityid)
     FROM Generate_Time gt
LEFT JOIN Data d
       ON gt.Time between d.start and IFNULL(d.finish,gt.Time)
 GROUP BY gt.Time

CodePudding user response：

This ended up being simply a case of the result set being too large. In my real data, the result set before grouping was ~19,000,000 records. I was able to do some partitioning on my client side, splitting the queries into smaller discrete chunks which improved performance ~10x, which still wasn't quite as fast as I wanted but was acceptable for my use case.