I have a table in ClickHouse with events of connecting and disconnecting to system.
Query select timestamp, username, event from table
gives following result.
timestamp | username | event |
---|---|---|
December 20, 2022, 18:24 | 1 | Connect |
December 20, 2022, 18:30 | 1 | Disconnect |
December 20, 2022, 18:34 | 1 | Connect |
December 21, 2022, 12:07 | 1 | Disconnect |
December 20, 2022, 12:15 | 2 | Connect |
December 20, 2022, 12:47 | 2 | Disconnect |
The session must be show in table as finished by the end of the day. If user was connected to system on 20th December and had no "Disconnect" after that in the same day, I have to add such 'Disconnect' event at the table with some query. And also I have to add row with event of 'Connect' to the table at 00:00 of next day. For example, in sample table you can see that user #1 had not finished session on 20th December, so I want to have the following result:
timestamp | username | event |
---|---|---|
December 20, 2022, 18:24 | 1 | Connect |
December 20, 2022, 18:30 | 1 | Disconnect |
December 20, 2022, 18:34 | 1 | Connect |
December 20, 2022, 23:59 | 1 | Disconnect |
December 21, 2022, 00:00 | 1 | Connect |
December 21, 2022, 12:07 | 1 | Disconnect |
December 20, 2022, 12:15 | 2 | Connect |
December 20, 2022, 12:47 | 2 | Disconnect |
Is there any way to amend the query so it will work as I described above? ClickHouse is not so common as Posrgres or SQL Server, as far as I know, so code in Postgres dialect will be fine, I will find out how to make the same with ClickHouse.
CodePudding user response:
First identify these Disconnect events that need to be preceded by a Disconnect/Connect pair. This is t
CTE, overnight
attribute. Then insert a Disconnect/Connect pair into the_table
for every record of t
with overnight
true.
with t as
(
select *,
"timestamp"::date > lag("timestamp") over (partition by username order by "timestamp")::date overnight
from the_table
where "event" = 'Disconnect'
)
insert into the_table ("timestamp", "username", "event")
select ts, un, ev
from t
cross join lateral
(
values (date_trunc('day', "timestamp") - interval '1 second', "username", 'Disconnect'),
(date_trunc('day', "timestamp"), "username", 'Connect')
) as l(ts, un, ev)
where overnight;
DB-fiddle demo
CodePudding user response:
You do not need lateral
join to achieve desired result in Clickhouse (JOINs
in Clickhouse are always compute heavy operations since it is a columnar store. ClickHouse takes the right table and creates a hash table for it in RAM).
You can use UNION ALL
and ARRAY JOIN
in specific way to generate missing rows:
CREATE TABLE connections
(
`timestamp` DateTime,
`username` LowCardinality(String),
`event` enum('Connect', 'Disconnect')
)
ENGINE = Memory;
INSERT INTO connections VALUES
('2022-12-20 18:24:00'::DateTime, '1', 'Connect')
('2022-12-20 18:30:00'::DateTime, '1', 'Disconnect')
('2022-12-20 18:34:00'::DateTime, '1', 'Connect')
('2022-12-21 12:07:00'::DateTime, '1', 'Disconnect')
('2022-12-20 12:15:00'::DateTime, '2', 'Connect')
('2022-12-20 12:47:00'::DateTime, '2', 'Disconnect');
SELECT * FROM
(
SELECT
timestamp, username, event
FROM
connections
UNION ALL
SELECT
timestamp, username, event
FROM
(
SELECT
[toStartOfDay(timestamp) INTERVAL '1 DAY' - INTERVAL '1 SECOND',
toStartOfDay(timestamp) INTERVAL '1 DAY' ] timestamps,
username,
[ 'Disconnect', 'Connect' ] :: Array(Enum('Connect', 'Disconnect')) events
FROM
connections
GROUP BY
toStartOfDay(timestamp), username
HAVING
anyLast(event) = 'Connect'
) ARRAY JOIN
timestamps AS timestamp,
events AS event
)
ORDER BY
username, timestamp
Here is the result:
┌───────────timestamp─┬─username─┬─event──────┐
│ 2022-12-20 18:24:00 │ 1 │ Connect │
│ 2022-12-20 18:30:00 │ 1 │ Disconnect │
│ 2022-12-20 18:34:00 │ 1 │ Connect │
│ 2022-12-20 23:59:59 │ 1 │ Disconnect │
│ 2022-12-21 00:00:00 │ 1 │ Connect │
│ 2022-12-21 12:07:00 │ 1 │ Disconnect │
│ 2022-12-20 12:15:00 │ 2 │ Connect │
│ 2022-12-20 12:47:00 │ 2 │ Disconnect │
└─────────────────────┴──────────┴────────────┘
8 rows in set. Elapsed: 0.011 sec.