Given the following data:
┌─id────────────┬──────────created_at─┬─state─┐
│ 1234567890123 │ 2022-11-26 22:58:28 │ 0 │
│ 1234567890123 │ 2022-11-26 22:57:00 │ 0 │
│ 1234567890123 │ 2022-11-26 22:50:38 │ 0 │
│ 1234567890123 │ 2022-11-26 22:41:46 │ 0 │
│ 1234567890123 │ 2022-11-26 22:37:08 │ 0 │
│ 1234567890123 │ 2022-11-26 22:28:09 │ 0 │
│ 1234567890123 │ 2022-11-26 22:28:09 │ 0 │
│ 1234567890123 │ 2022-11-26 22:25:13 │ 0 │
│ 1234567890123 │ 2022-11-26 22:21:25 │ 0 │
│ 1234567890123 │ 2022-11-26 22:15:43 │ 0 │
│ 1234567890123 │ 2022-11-26 22:03:41 │ 0 │
│ 1234567890123 │ 2022-11-26 21:28:39 │ 1 │
│ 1234567890123 │ 2022-11-26 21:28:39 │ 1 │
│ 1234567890123 │ 2022-11-26 21:08:03 │ 1 │
│ 1234567890123 │ 2022-11-26 21:08:03 │ 1 │
│ 1234567890123 │ 2022-11-26 20:03:45 │ 1 │
│ 1234567890123 │ 2022-11-26 20:03:45 │ 1 │
│ 1234567890123 │ 2022-11-26 20:02:34 │ 0 │
│ 1234567890123 │ 2022-11-26 20:00:58 │ 0 │
│ 1234567890123 │ 2022-11-26 19:58:26 │ 0 │
│ 1234567890123 │ 2022-11-26 19:56:53 │ 0 │
│ 1234567890123 │ 2022-11-26 19:55:29 │ 0 │
│ 1234567890123 │ 2022-11-26 19:51:41 │ 0 │
│ 1234567890123 │ 2022-11-26 19:51:41 │ 0 │
│ 1234567890123 │ 2022-11-26 19:26:19 │ 1 │
│ 1234567890123 │ 2022-11-26 19:26:19 │ 1 │
│ 1234567890123 │ 2022-11-26 16:06:16 │ 1 │
│ 1234567890123 │ 2022-11-26 16:06:16 │ 1 │
│ 1234567890123 │ 2022-11-26 15:34:28 │ 0 │
│ 1234567890123 │ 2022-11-26 15:27:46 │ 0 │
└───────────────┴─────────────────────┴───────┘
I need to group the data in a way that the created_at of the first true
state is grouped to the first false
state. The end result should be:
┌─id────────────┬───────────────start─┬─────────────────end─┐
│ 1234567890123 │ 2022-11-26 16:06:16 │ 2022-11-26 19:51:41 │
│ 1234567890123 │ 2022-11-26 20:03:45 │ 2022-11-26 22:03:41 │
└───────────────┴─────────────────────┴─────────────────────┘
Given that, I need a way to have the data filtered in this way:
┌─id────────────┬──────────created_at─┬─state─┐
│ 1234567890123 │ 2022-11-26 22:03:41 │ 0 │
│ 1234567890123 │ 2022-11-26 20:03:45 │ 1 │
│ 1234567890123 │ 2022-11-26 19:51:41 │ 0 │
│ 1234567890123 │ 2022-11-26 16:06:16 │ 1 │
└───────────────┴─────────────────────┴───────┘
So I can then apply a LEAD/LAG window function and group the values.
But I cannot find a way to group the data in that way.
I've tried several combinations of LEAD/LAG, RANK, but I could not find a way that would match the first of every event instead of the last. (First time it goes TRUE, then the following first FALSE...)
This is the closest I could get, but the results are wrong:
WITH states AS (
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:58:28') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:57:00') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:50:38') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:41:46') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:37:08') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:28:09') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:28:09') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:25:13') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:21:25') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:15:43') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 22:03:41') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 21:28:39') AS created_at, 1 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 21:28:39') AS created_at, 1 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 21:08:03') AS created_at, 1 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 21:08:03') AS created_at, 1 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 20:03:45') AS created_at, 1 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 20:03:45') AS created_at, 1 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 20:02:34') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 20:00:58') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:58:26') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:56:53') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:55:29') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:51:41') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:51:41') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:26:19') AS created_at, 1 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 19:26:19') AS created_at, 1 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 16:06:16') AS created_at, 1 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 16:06:16') AS created_at, 1 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 15:34:28') AS created_at, 0 AS state UNION ALL
SELECT '1234567890123' AS id, toDateTime('2022-11-26 15:27:46') AS created_at, 0 AS state
)
SELECT
id,
created_at,
state,
next.1 AS next_created_at,
next.2 AS next_state
FROM (
SELECT
id,
created_at,
state,
any((created_at, state)) OVER (PARTITION BY id ORDER BY created_at ASC ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS next
FROM states
ORDER BY created_at DESC
)
WHERE
state = 1
AND next_state = 0
And the result of that query:
┌─id────────────┬──────────created_at─┬─state─┬─────next_created_at─┬─next_state─┐
│ 1234567890123 │ 2022-11-26 21:28:39 │ 1 │ 2022-11-26 22:03:41 │ 0 │
│ 1234567890123 │ 2022-11-26 19:26:19 │ 1 │ 2022-11-26 19:51:41 │ 0 │
└───────────────┴─────────────────────┴───────┴─────────────────────┴────────────┘
CodePudding user response:
@Fermuch , we could solve this problem using lead and case statment.
with states as (
----your data
),staging_data as (
select
id,
created_at,state,
case when state = lead(state) over (partition by id order by null) then state else null end as pos
from **states**
)
select id,created_at,state from staging_data where pos is null;
CodePudding user response:
Answering my own question for completion and future travelers sake.
First, let's start off by creating a test dataset:
CREATE TEMPORARY TABLE data (
id String,
created_at DateTime,
state Boolean
);
INSERT INTO data(id, created_at, state) VALUES
('1234567890123', toDateTime('2022-11-26 22:58:28'), 0),
('1234567890123', toDateTime('2022-11-26 22:57:00'), 0),
('1234567890123', toDateTime('2022-11-26 22:50:38'), 0),
('1234567890123', toDateTime('2022-11-26 22:41:46'), 0),
('1234567890123', toDateTime('2022-11-26 22:37:08'), 0),
('1234567890123', toDateTime('2022-11-26 22:28:09'), 0),
('1234567890123', toDateTime('2022-11-26 22:28:09'), 0),
('1234567890123', toDateTime('2022-11-26 22:25:13'), 0),
('1234567890123', toDateTime('2022-11-26 22:21:25'), 0),
('1234567890123', toDateTime('2022-11-26 22:15:43'), 0),
('1234567890123', toDateTime('2022-11-26 22:03:41'), 0),
('1234567890123', toDateTime('2022-11-26 21:28:39'), 1),
('1234567890123', toDateTime('2022-11-26 21:28:39'), 1),
('1234567890123', toDateTime('2022-11-26 21:08:03'), 1),
('1234567890123', toDateTime('2022-11-26 21:08:03'), 1),
('1234567890123', toDateTime('2022-11-26 20:03:45'), 1),
('1234567890123', toDateTime('2022-11-26 20:03:45'), 1),
('1234567890123', toDateTime('2022-11-26 20:02:34'), 0),
('1234567890123', toDateTime('2022-11-26 20:00:58'), 0),
('1234567890123', toDateTime('2022-11-26 19:58:26'), 0),
('1234567890123', toDateTime('2022-11-26 19:56:53'), 0),
('1234567890123', toDateTime('2022-11-26 19:55:29'), 0),
('1234567890123', toDateTime('2022-11-26 19:51:41'), 0),
('1234567890123', toDateTime('2022-11-26 19:51:41'), 0),
('1234567890123', toDateTime('2022-11-26 19:26:19'), 1),
('1234567890123', toDateTime('2022-11-26 19:26:19'), 1),
('1234567890123', toDateTime('2022-11-26 16:06:16'), 1),
('1234567890123', toDateTime('2022-11-26 16:06:16'), 1),
('1234567890123', toDateTime('2022-11-26 15:34:28'), 0),
('1234567890123', toDateTime('2022-11-26 15:27:46'), 0);
Now, the real query:
SELECT
id,
created_at AS start,
COALESCE(next_created_at, NOW()) AS end
FROM (
SELECT
*,
max(
if(toYear(created_at) = 1969, NULL, created_at)
) OVER (PARTITION BY id ORDER BY created_at ASC ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS next_created_at
FROM (
SELECT
*
FROM (
SELECT
id,
created_at,
state,
runningDifference(state) AS diff
FROM (SELECT * FROM data)
ORDER BY created_at ASC
)
WHERE (
(state = 1 AND diff = 1) -- this means it is a start
OR
(state = 0 AND diff = -1) -- this means it is an end
)
ORDER BY created_at ASC
)
ORDER BY created_at DESC
)
WHERE state = 1;
It returns:
┌─id────────────┬───────────────start─┬─────────────────end─┐
│ 1234567890123 │ 2022-11-26 20:03:45 │ 2022-11-26 22:03:41 │
│ 1234567890123 │ 2022-11-26 16:06:16 │ 2022-11-26 19:51:41 │
└───────────────┴─────────────────────┴─────────────────────┘
What's the magic? All magic resides on runningDifference(state)
. It is documented here. Long store short, it compares current value with next value and gives the difference (delta).
I've used that difference to check when a change occurs (from state=0 to state=1 and vice-versa). Since the change to 0 to 1 is 1, and the change from 1 to 0 is -1, you can even identify which change is and activation (0 to 1) and an end (1 to 0).
Since I'm only looking for activations, I filter by state = 1
.
The query can be compressed and made more readable, but I had weird bugs with data ordering when using runningDifference
(since it does not work over a window, it literally scans the next row it has). That's why there are several embedded queries.