I have a SQLite table with an Id and an active period, and I am trying to get counts of the number of active of rows over a sequence of times.
A vastly simplified version of this table is:
CREATE TABLE Data (
EntityId INTEGER NOT NULL,
Start INTEGER NOT NULL,
Finish INTEGER
);
With some example data
INSERT INTO Data VALUES
(1, 0, 2),
(1, 4, 6),
(1, 8, NULL),
(2, 5, 7),
(2, 9, NULL),
(3, 8, NULL);
And an desired output of something like:
Time | Count |
---|---|
0 | 1 |
1 | 1 |
2 | 0 |
3 | 0 |
4 | 1 |
5 | 2 |
6 | 1 |
7 | 0 |
8 | 2 |
9 | 3 |
For which I am querying with:
WITH RECURSIVE Generate_Time(Time) AS (
SELECT 0
UNION ALL
SELECT Time 1 FROM Generate_Time
WHERE Time 1 <= (SELECT MAX(Start) FROM Data)
)
SELECT Time, COUNT(EntityId)
FROM Data
JOIN Generate_Time ON Start <= Time AND (Finish > Time OR Finish IS NULL)
GROUP BY Time
There is also some data I need to categorise the counts by (some are on the original table, some are using a join), but I am hitting a performance bottleneck in the order of seconds on even small amounts of data (~25,000 rows) without any of that.
I have added an index on the table covering Start/End:
CREATE INDEX Ix_Data ON Data (
Start,
Finish
);
and that helped somewhat but I can't help but feel there's a more elegant & performant way of doing this. Using the CTE to iterate over a range doesn't seem like it will scale very well but I can't think of another way to calculate what I need.
I've been looking at the query plan too, and I think the slow part of the GROUP BY
since it can't use an index for that since it's from the CTE so SQLite generates a temporary BTree:
3 0 0 MATERIALIZE 3
7 3 0 SETUP
8 7 0 SCAN CONSTANT ROW
21 3 0 RECURSIVE STEP
22 21 0 SCAN TABLE Generate_Time
27 21 0 SCALAR SUBQUERY 2
32 27 0 SEARCH TABLE Data USING COVERING INDEX Ix_Data
57 0 0 SCAN SUBQUERY 3
59 0 0 SEARCH TABLE Data USING INDEX Ix_Data (Start<?)
71 0 0 USE TEMP B-TREE FOR GROUP BY
Any suggestions of a way to speed this query up, or even a better way of storing this data to craft a tighter query would be most welcome!
CodePudding user response:
To get to the desired output as per your question, the following can be done. For better performance, on option is to make use of generate_series to generate rows instead of the recursive CTE and limit the number of rows to the max-value available in data.
WITH RECURSIVE Generate_Time(Time) AS (
SELECT 0
UNION ALL
SELECT Time 1 FROM Generate_Time
WHERE Time 1 <= (SELECT MAX(Start) FROM Data)
)
SELECT gt.Time
,count(d.entityid)
FROM Generate_Time gt
LEFT JOIN Data d
ON gt.Time between d.start and IFNULL(d.finish,gt.Time)
GROUP BY gt.Time
CodePudding user response:
This ended up being simply a case of the result set being too large. In my real data, the result set before grouping was ~19,000,000 records. I was able to do some partitioning on my client side, splitting the queries into smaller discrete chunks which improved performance ~10x, which still wasn't quite as fast as I wanted but was acceptable for my use case.