Home > OS >  Group by first difference on Clickhouse
Group by first difference on Clickhouse

Time:11-30

Given the following data:

┌─id────────────┬──────────created_at─┬─state─┐
│ 1234567890123 │ 2022-11-26 22:58:28 │     0 │
│ 1234567890123 │ 2022-11-26 22:57:00 │     0 │
│ 1234567890123 │ 2022-11-26 22:50:38 │     0 │
│ 1234567890123 │ 2022-11-26 22:41:46 │     0 │
│ 1234567890123 │ 2022-11-26 22:37:08 │     0 │
│ 1234567890123 │ 2022-11-26 22:28:09 │     0 │
│ 1234567890123 │ 2022-11-26 22:28:09 │     0 │
│ 1234567890123 │ 2022-11-26 22:25:13 │     0 │
│ 1234567890123 │ 2022-11-26 22:21:25 │     0 │
│ 1234567890123 │ 2022-11-26 22:15:43 │     0 │
│ 1234567890123 │ 2022-11-26 22:03:41 │     0 │
│ 1234567890123 │ 2022-11-26 21:28:39 │     1 │
│ 1234567890123 │ 2022-11-26 21:28:39 │     1 │
│ 1234567890123 │ 2022-11-26 21:08:03 │     1 │
│ 1234567890123 │ 2022-11-26 21:08:03 │     1 │
│ 1234567890123 │ 2022-11-26 20:03:45 │     1 │
│ 1234567890123 │ 2022-11-26 20:03:45 │     1 │
│ 1234567890123 │ 2022-11-26 20:02:34 │     0 │
│ 1234567890123 │ 2022-11-26 20:00:58 │     0 │
│ 1234567890123 │ 2022-11-26 19:58:26 │     0 │
│ 1234567890123 │ 2022-11-26 19:56:53 │     0 │
│ 1234567890123 │ 2022-11-26 19:55:29 │     0 │
│ 1234567890123 │ 2022-11-26 19:51:41 │     0 │
│ 1234567890123 │ 2022-11-26 19:51:41 │     0 │
│ 1234567890123 │ 2022-11-26 19:26:19 │     1 │
│ 1234567890123 │ 2022-11-26 19:26:19 │     1 │
│ 1234567890123 │ 2022-11-26 16:06:16 │     1 │
│ 1234567890123 │ 2022-11-26 16:06:16 │     1 │
│ 1234567890123 │ 2022-11-26 15:34:28 │     0 │
│ 1234567890123 │ 2022-11-26 15:27:46 │     0 │
└───────────────┴─────────────────────┴───────┘

I need to group the data in a way that the created_at of the first true state is grouped to the first false state. The end result should be:

┌─id────────────┬───────────────start─┬─────────────────end─┐
│ 1234567890123 │ 2022-11-26 16:06:16 │ 2022-11-26 19:51:41 │
│ 1234567890123 │ 2022-11-26 20:03:45 │ 2022-11-26 22:03:41 │
└───────────────┴─────────────────────┴─────────────────────┘

Given that, I need a way to have the data filtered in this way:

┌─id────────────┬──────────created_at─┬─state─┐
│ 1234567890123 │ 2022-11-26 22:03:41 │     0 │
│ 1234567890123 │ 2022-11-26 20:03:45 │     1 │
│ 1234567890123 │ 2022-11-26 19:51:41 │     0 │
│ 1234567890123 │ 2022-11-26 16:06:16 │     1 │
└───────────────┴─────────────────────┴───────┘

So I can then apply a LEAD/LAG window function and group the values.

But I cannot find a way to group the data in that way.

I've tried several combinations of LEAD/LAG, RANK, but I could not find a way that would match the first of every event instead of the last. (First time it goes TRUE, then the following first FALSE...)

This is the closest I could get, but the results are wrong:

WITH states AS (
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:58:28') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:57:00') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:50:38') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:41:46') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:37:08') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:28:09') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:28:09') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:25:13') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:21:25') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:15:43') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:03:41') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 21:28:39') AS created_at, 1 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 21:28:39') AS created_at, 1 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 21:08:03') AS created_at, 1 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 21:08:03') AS created_at, 1 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 20:03:45') AS created_at, 1 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 20:03:45') AS created_at, 1 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 20:02:34') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 20:00:58') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:58:26') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:56:53') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:55:29') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:51:41') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:51:41') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:26:19') AS created_at, 1 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:26:19') AS created_at, 1 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 16:06:16') AS created_at, 1 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 16:06:16') AS created_at, 1 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 15:34:28') AS created_at, 0 AS state UNION ALL
  SELECT '1234567890123' AS id, toDateTime('2022-11-26 15:27:46') AS created_at, 0 AS state
)
SELECT
  id,
  created_at,
  state,
  next.1 AS next_created_at,
  next.2 AS next_state
FROM (
  SELECT
    id,
    created_at,
    state,
    any((created_at, state)) OVER (PARTITION BY id ORDER BY created_at ASC ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS next
  FROM states
  ORDER BY created_at DESC
)
WHERE
  state = 1
  AND next_state = 0

And the result of that query:

┌─id────────────┬──────────created_at─┬─state─┬─────next_created_at─┬─next_state─┐
│ 1234567890123 │ 2022-11-26 21:28:39 │     1 │ 2022-11-26 22:03:41 │          0 │
│ 1234567890123 │ 2022-11-26 19:26:19 │     1 │ 2022-11-26 19:51:41 │          0 │
└───────────────┴─────────────────────┴───────┴─────────────────────┴────────────┘

CodePudding user response:

@Fermuch , we could solve this problem using lead and case statment.

with states as (
----your data
),staging_data as (
select 
id,
created_at,state,
case when state = lead(state) over (partition by id order by null) then state else null end as pos 
from **states**
)
select id,created_at,state from staging_data where pos is null;

CodePudding user response:

Answering my own question for completion and future travelers sake.

First, let's start off by creating a test dataset:

CREATE TEMPORARY TABLE data (
  id String,
  created_at DateTime,
  state Boolean
);


INSERT INTO data(id, created_at, state) VALUES
  ('1234567890123', toDateTime('2022-11-26 22:58:28'), 0),
  ('1234567890123', toDateTime('2022-11-26 22:57:00'), 0),
  ('1234567890123', toDateTime('2022-11-26 22:50:38'), 0),
  ('1234567890123', toDateTime('2022-11-26 22:41:46'), 0),
  ('1234567890123', toDateTime('2022-11-26 22:37:08'), 0),
  ('1234567890123', toDateTime('2022-11-26 22:28:09'), 0),
  ('1234567890123', toDateTime('2022-11-26 22:28:09'), 0),
  ('1234567890123', toDateTime('2022-11-26 22:25:13'), 0),
  ('1234567890123', toDateTime('2022-11-26 22:21:25'), 0),
  ('1234567890123', toDateTime('2022-11-26 22:15:43'), 0),
  ('1234567890123', toDateTime('2022-11-26 22:03:41'), 0),
  ('1234567890123', toDateTime('2022-11-26 21:28:39'), 1),
  ('1234567890123', toDateTime('2022-11-26 21:28:39'), 1),
  ('1234567890123', toDateTime('2022-11-26 21:08:03'), 1),
  ('1234567890123', toDateTime('2022-11-26 21:08:03'), 1),
  ('1234567890123', toDateTime('2022-11-26 20:03:45'), 1),
  ('1234567890123', toDateTime('2022-11-26 20:03:45'), 1),
  ('1234567890123', toDateTime('2022-11-26 20:02:34'), 0),
  ('1234567890123', toDateTime('2022-11-26 20:00:58'), 0),
  ('1234567890123', toDateTime('2022-11-26 19:58:26'), 0),
  ('1234567890123', toDateTime('2022-11-26 19:56:53'), 0),
  ('1234567890123', toDateTime('2022-11-26 19:55:29'), 0),
  ('1234567890123', toDateTime('2022-11-26 19:51:41'), 0),
  ('1234567890123', toDateTime('2022-11-26 19:51:41'), 0),
  ('1234567890123', toDateTime('2022-11-26 19:26:19'), 1),
  ('1234567890123', toDateTime('2022-11-26 19:26:19'), 1),
  ('1234567890123', toDateTime('2022-11-26 16:06:16'), 1),
  ('1234567890123', toDateTime('2022-11-26 16:06:16'), 1),
  ('1234567890123', toDateTime('2022-11-26 15:34:28'), 0),
  ('1234567890123', toDateTime('2022-11-26 15:27:46'), 0);

Now, the real query:

SELECT
  id,
  created_at AS start,
  COALESCE(next_created_at, NOW()) AS end
FROM (
  SELECT
    *,
    max(
      if(toYear(created_at) = 1969, NULL, created_at)
    ) OVER (PARTITION BY id ORDER BY created_at ASC ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS next_created_at
  FROM (
    SELECT
      *
    FROM (
      SELECT
        id,
        created_at,
        state,
        runningDifference(state) AS diff
      FROM (SELECT * FROM data)
      ORDER BY created_at ASC
    )
    WHERE (
      (state = 1 AND diff = 1) -- this means it is a start
    OR
      (state = 0 AND diff = -1) -- this means it is an end
    )
    ORDER BY created_at ASC
  )
  ORDER BY created_at DESC
)
WHERE state = 1;

It returns:

┌─id────────────┬───────────────start─┬─────────────────end─┐
│ 1234567890123 │ 2022-11-26 20:03:45 │ 2022-11-26 22:03:41 │
│ 1234567890123 │ 2022-11-26 16:06:16 │ 2022-11-26 19:51:41 │
└───────────────┴─────────────────────┴─────────────────────┘

What's the magic? All magic resides on runningDifference(state). It is documented here. Long store short, it compares current value with next value and gives the difference (delta).

I've used that difference to check when a change occurs (from state=0 to state=1 and vice-versa). Since the change to 0 to 1 is 1, and the change from 1 to 0 is -1, you can even identify which change is and activation (0 to 1) and an end (1 to 0).

Since I'm only looking for activations, I filter by state = 1.

The query can be compressed and made more readable, but I had weird bugs with data ordering when using runningDifference (since it does not work over a window, it literally scans the next row it has). That's why there are several embedded queries.

  • Related