Home > Software design >  Query for time-series-like counters in psql
Query for time-series-like counters in psql

Time:01-22

I've got the following append-only table in psql:

CREATE TABLE IF NOT EXISTS data (
    id UUID DEFAULT gen_random_uuid () PRIMARY KEY,

    test_id UUID NOT NULL,
    user_id UUID NOT NULL,
    completed BOOL NOT NULL DEFAULT False,

    inserted_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'UTC'),
);

CREATE INDEX some_idx ON data (user_id, test_id, inserted_at DESC);
CREATE INDEX some_idx2 ON data (test_id, inserted_at DESC);

A single user_id might have multiple entries for a given test_id, but only one can be completed (the completed entry is also the last one).

I'm querying for a given test_id. What I need is time-series-like data for each day in the past week. For each day, I should have the following:

  • total - total entries for unique users WHERE inserted_at < "day"
  • completed - total completed entries for unique users where inserted_at < "day"

Ultimately, total and completed are like counters and I'm simply trying to take their values for each day in the past week. For instance:

| date       | total | completed |
|------------|-------|-----------|
| 2022.01.19 | 100   | 50        |
| 2022.01.18 | 90    | 45        |
| ...        |       |           |

What would be a query with an efficient query-plan? I can consider adding new indexes or modifying the existing one.

PS: I've got a working version here:

SELECT date, entered, completed
FROM (
     SELECT d::date AS date
     FROM generate_series('2023-01-12', now(),INTERVAL '1 day') AS d
) AS dates
cross join lateral (
    SELECT COUNT(DISTINCT user_id) AS entered,
           COUNT(1) FILTER (WHERE completed) AS completed // no need for distinct as completed is guaranteed to be once per user
    FROM data
    WHERE
        test_id = 'someId' AND
        inserted_at < dates.date
) AS vals

I don't think this is a good/performant solution as it rescans the table with every lateral join iteration. Here's the query plan:

 ---------------------------------------------------------------------------------------------------------------------------->
| QUERY PLAN                                                                                                                 >
|---------------------------------------------------------------------------------------------------------------------------->
| Nested Loop  (cost=185.18..185218.25 rows=1000 width=28) (actual time=0.928..7.687 rows=8 loops=1)                         >
|   ->  Function Scan on generate_series d  (cost=0.01..10.01 rows=1000 width=8) (actual time=0.009..0.012 rows=8 loops=1)   >
|   ->  Aggregate  (cost=185.17..185.18 rows=1 width=16) (actual time=0.957..0.957 rows=1 loops=8)                           >
|         ->  Bitmap Heap Scan on data  (cost=12.01..183.36 rows=363 width=38) (actual time=0.074..0.197 rows=779 loops>
|               Recheck Cond: ((test_id = 'someId'::uuid) AND (inserted_at < (d.d)::date))  >
|               Heap Blocks: exact=629                                                                                       >
|               ->  Bitmap Index Scan on some_idx2  (cost=0.00..11.92 rows=363 width=0) (actual time=>
|                     Index Cond: ((test_id = 'someId'::uuid) AND (inserted_at < (d.d)::date>
| Planning Time: 0.261 ms                                                                                                    >
| Execution Time: 7.733 ms                                                                                                   >
 ---------------------------------------------------------------------------------------------------------------------------->

I'm sure I'm missing some convenient functions here that will help out. All help is appreciated :pray:

CodePudding user response:

After chat we found a solution

   SELECT
     date_trunc('day',inserted_at) AS adate,
     COUNT(user_id) OVER (
        ORDER BY inserted_at ASC 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
     ) -
     SUM(CASE WHEN completed THEN 1 ELSE 0 END) OVER (
        ORDER BY date_trunc('day', inserted_at) ASC 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
     ) as user_cnt,
     SUM(CASE WHEN completed THEN 1 ELSE 0 END) OVER (
        ORDER BY date_trunc('day', inserted_at) ASC 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
     ) AS completed
    FROM executions
    WHERE journey_id = 'd2be0e01-19b1-403e-8659-ce6222f074fd'
    ORDER BY date_trunc('day', inserted_at) ASC

You can see we are using the same SUM window function twice. All SQL engines will optimize this to calculate once giving the expected result for the 2nd column. (completed items were counted as an addition user).


prior answers below

ok when I looked at it you don't need a window function after all -- just the trick of CASE statement in a SUM() with GROUP BY

SELECT COUNT(DISTINCT user_id) AS entered,
       SUM(CASE WHEN completed THEN 1 ELSE 0 END) AS completed
FROM data
WHERE test_id = 'someId' 
GROUP BY inserted_at

To get all prior for a given date looks something like this:

SELECT date_trunc(day,inserted_at) AS date,
   DENSE_RANK()
     OVER (PARTITION BY user_id ORDER BY inserted_at ASC 
             BETWEEN ROWS UNBOUNDED PRECEDING AND CURRENT ROW) as user_cnt,
   SUM(CASE WHEN completed THEN 1 ELSE 0 END) 
     OVER (ORDER BY inserted_at ASC 
             BETWEEN ROWS UNBOUNDED PRECEDING AND CURRENT ROW) AS completed
FROM data
WHERE test_id = 'someId'
ORDER BY inserted_at ASC
  • Related