I've got the following append-only table in psql:
CREATE TABLE IF NOT EXISTS data (
id UUID DEFAULT gen_random_uuid () PRIMARY KEY,
test_id UUID NOT NULL,
user_id UUID NOT NULL,
completed BOOL NOT NULL DEFAULT False,
inserted_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'UTC'),
);
CREATE INDEX some_idx ON data (user_id, test_id, inserted_at DESC);
CREATE INDEX some_idx2 ON data (test_id, inserted_at DESC);
A single user_id
might have multiple entries for a given test_id
, but only one can be completed
(the completed
entry is also the last one).
I'm querying for a given test_id
. What I need is time-series-like data for each day in the past week. For each day, I should have the following:
- total - total entries for unique users WHERE
inserted_at < "day"
- completed - total completed entries for unique users where
inserted_at < "day"
Ultimately, total
and completed
are like counters and I'm simply trying to take their values for each day in the past week. For instance:
| date | total | completed |
|------------|-------|-----------|
| 2022.01.19 | 100 | 50 |
| 2022.01.18 | 90 | 45 |
| ... | | |
What would be a query with an efficient query-plan? I can consider adding new indexes or modifying the existing one.
PS: I've got a working version here:
SELECT date, entered, completed
FROM (
SELECT d::date AS date
FROM generate_series('2023-01-12', now(),INTERVAL '1 day') AS d
) AS dates
cross join lateral (
SELECT COUNT(DISTINCT user_id) AS entered,
COUNT(1) FILTER (WHERE completed) AS completed // no need for distinct as completed is guaranteed to be once per user
FROM data
WHERE
test_id = 'someId' AND
inserted_at < dates.date
) AS vals
I don't think this is a good/performant solution as it rescans the table with every lateral join iteration. Here's the query plan:
---------------------------------------------------------------------------------------------------------------------------->
| QUERY PLAN >
|---------------------------------------------------------------------------------------------------------------------------->
| Nested Loop (cost=185.18..185218.25 rows=1000 width=28) (actual time=0.928..7.687 rows=8 loops=1) >
| -> Function Scan on generate_series d (cost=0.01..10.01 rows=1000 width=8) (actual time=0.009..0.012 rows=8 loops=1) >
| -> Aggregate (cost=185.17..185.18 rows=1 width=16) (actual time=0.957..0.957 rows=1 loops=8) >
| -> Bitmap Heap Scan on data (cost=12.01..183.36 rows=363 width=38) (actual time=0.074..0.197 rows=779 loops>
| Recheck Cond: ((test_id = 'someId'::uuid) AND (inserted_at < (d.d)::date)) >
| Heap Blocks: exact=629 >
| -> Bitmap Index Scan on some_idx2 (cost=0.00..11.92 rows=363 width=0) (actual time=>
| Index Cond: ((test_id = 'someId'::uuid) AND (inserted_at < (d.d)::date>
| Planning Time: 0.261 ms >
| Execution Time: 7.733 ms >
---------------------------------------------------------------------------------------------------------------------------->
I'm sure I'm missing some convenient functions here that will help out. All help is appreciated :pray:
CodePudding user response:
After chat we found a solution
SELECT
date_trunc('day',inserted_at) AS adate,
COUNT(user_id) OVER (
ORDER BY inserted_at ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) -
SUM(CASE WHEN completed THEN 1 ELSE 0 END) OVER (
ORDER BY date_trunc('day', inserted_at) ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as user_cnt,
SUM(CASE WHEN completed THEN 1 ELSE 0 END) OVER (
ORDER BY date_trunc('day', inserted_at) ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS completed
FROM executions
WHERE journey_id = 'd2be0e01-19b1-403e-8659-ce6222f074fd'
ORDER BY date_trunc('day', inserted_at) ASC
You can see we are using the same SUM window function twice. All SQL engines will optimize this to calculate once giving the expected result for the 2nd column. (completed items were counted as an addition user).
prior answers below
ok when I looked at it you don't need a window function after all -- just the trick of CASE statement in a SUM() with GROUP BY
SELECT COUNT(DISTINCT user_id) AS entered,
SUM(CASE WHEN completed THEN 1 ELSE 0 END) AS completed
FROM data
WHERE test_id = 'someId'
GROUP BY inserted_at
To get all prior for a given date looks something like this:
SELECT date_trunc(day,inserted_at) AS date,
DENSE_RANK()
OVER (PARTITION BY user_id ORDER BY inserted_at ASC
BETWEEN ROWS UNBOUNDED PRECEDING AND CURRENT ROW) as user_cnt,
SUM(CASE WHEN completed THEN 1 ELSE 0 END)
OVER (ORDER BY inserted_at ASC
BETWEEN ROWS UNBOUNDED PRECEDING AND CURRENT ROW) AS completed
FROM data
WHERE test_id = 'someId'
ORDER BY inserted_at ASC