I have the next table that stores events: (simplified structure)
ID | User | Action | Timestamp |
---|---|---|---|
12 | user1 | END | 2022-01-01 05:00 |
43 | user1 | START | 2022-01-01 04:00 |
54 | user1 | END | 2022-01-01 03:00 |
13 | user1 | START | 2022-01-01 02:00 |
I need to join 2 events in one row, so any START event is accompanied by the END event that comes after that.
So the result should be the next:
ID1 | ID2 | User | Start Timestamp | End Timestamp |
---|---|---|---|---|
13 | 54 | user1 | 2022-01-01 02:00 | 2022-01-01 03:00 |
43 | 12 | user1 | 2022-01-01 04:00 | 2022-01-01 05:00 |
Ideally, it should not have to many performance issues, as there could be a lot of records in the table.
I've tried the next query:
select
s.id as "ID1",
e.id as "ID2",
s.user,
s.time as "Start Time",
e.time as "End Time"
from Events s
left join Events e on s.user = e.user
where s.action = 'START'
and e.action = 'END'
and s.timestamp < e.timestamp
but it will also match the record 13 to record 12. Is it possible to join the left side to right only once? (keeping in mind that is should be the next END record time-wise?
Thanks
CodePudding user response:
We want to get the nearest timestamp of the END
event for each START
event.
I would go with the following approach:
- Get the minimum greater than zero timestamp difference for each of the
START
events. - Now find the actual
END
event using thetimedelta
.
Assumptions
- At max we can have only one event which is not ended yet!
- For every
START
event, the timestamps will be unique. (Same goes forEND
event.
WITH closest_to_start AS (
SELECT
s.id,
MIN(TIMESTAMPDIFF(SECOND, s.timestamp, e.timestamp)) AS min_delta
FROM Events AS s
INNER JOIN Events AS e ON s.user = e.user
WHERE s.action = 'START'
AND e.action = 'END'
GROUP BY s.id
HAVING min_delta >= 0
)
SELECT s.id,
e.id
FROM Events AS s
OUTER JOIN closest_to_start ON closest_to_start.id = s.id
OUTER JOIN Events AS e ON e.id = s.id
WHERE s.action = 'START'
AND e.action = 'END'
AND
(
e.timestamp IS NULL
OR
TIMESTAMPDIFF(SECOND, s.timestamp, e.timestamp) = closest_to_start.min_delta
)
CodePudding user response:
Here is a PostgreSQL solution using lateral join. It might be working on HANA as no Postgres-specific features are used. The internal query selects the 'END' action for the same user that occurred soonest after the corresponding 'START'. Events that have started but not finished yet will have NULL values for "ID2" and "End timestamp".
create temporary table the_table(id integer, usr text, action text, ts timestamp);
insert into the_table values
(12,'user1','END','2022-01-01 05:00'),(43,'user1','START','2022-01-01 04:00'),
(54,'user1','END','2022-01-01 03:00'),(13,'user1','START','2022-01-01 02:00');
select tx.id as "ID1", l.id as "ID2", tx.usr as "User",
tx.ts as "Start timestamp", l.ts as "End timestamp"
from the_table as tx
left join lateral
(
select ti.id, ti.ts
from the_table as ti
where ti.action = 'END'
and ti.ts > tx.ts
and ti.usr = tx.usr
order by ti.ts - tx.ts
limit 1
) as l on true
where tx.action = 'START'
order by "Start timestamp";
CodePudding user response:
You can use the window function Lead.
with Daten
as
(
Select 12 as ID, 'user1' as Benutzer, 'END' as action, '05:00' as Time
Union
Select 43 as ID, 'user1' as Benutzer, 'Start' as action, '04:00' as Time
Union
Select 54 as ID, 'user1' as Benutzer, 'END' as action, '03:00' as Time
Union
Select 13 as ID, 'user1' as Benutzer, 'Start' as action, '02:00' as Time
)
Select
*
from
(
Select
*,
lead(ID,1) over (order by number) as ID2,
lead(action,1) over (order by number) as action2,
lead(time,1) over (order by number) as time2
from
(
Select
*,
ROW_NUMBER() OVER(ORDER BY Benutzer,Time,action) as number
from
Daten
) x
) y
where y.action = 'Start'
CodePudding user response:
The issue with your query above is that for each start event, there can be multiple end events, which occur after. However, you would like to choose the one that's 'closest' to the start event. You can achieve this by adding an additional aggregation.
Please find a HANA example (uses no HANA specific functionality):
SELECT
S.ID ID1,
S.USER,
S.ACTION,
S.TIMESTAMP START_TIME,
MIN(E.TIMESTAMP) END_TIME
FROM TEST S
JOIN TEST E ON (
s.USER = e.USER AND
s.ACTION = 'START' AND
e.ACTION = 'END' AND
e.TIMESTAMP >= s.TIMESTAMP
)
GROUP BY S.ID, S.ACTION, S.USER, S.TIMESTAMP
If you need to have E.ID
included, you will need to join it back to the result set. Note, that there may be multiple end events with the same timestamp, which you need to handle when joining back E.ID
.
If you additionally would like to include START
events without corresponding END
event, you can use the following:
INSERT INTO TEST VALUES (13, 'user1', 'START', '2022-01-01 09:00')
SELECT
S.ID ID1,
S.USER,
S.ACTION,
S.TIMESTAMP START_TIME,
MIN(E.TIMESTAMP) END_TIME
FROM TEST S
LEFT JOIN TEST E ON (
s.USER = e.USER AND
e.ACTION = 'END' AND
e.TIMESTAMP >= s.TIMESTAMP
)
WHERE s.ACTION ='START'
GROUP BY S.ID, S.ACTION, S.USER, S.TIMESTAMP
CodePudding user response:
One way is a lateral join that picks the smallest "end" timestamp that is greater than the "start" timestamp:
select st.id as id1,
en.id as id2,
st."timestamp" as start_timestamp,
en."timestamp" as end_timestamp
from events st
left join lateral (
select id, "timestamp"
from events e
where e."user" = st."user"
and e.action = 'END'
and e.timestamp >= st.timestamp
order by "timestamp"
fetch first 1 row only
) en on true
where st.action = 'START';
The above is standard ANSI SQL and works (at least) in Postgres.
In Postgres I would create an index on events ("user", "timestamp") where action = 'END'
to make the lateral query fast.