I have a situation where I'm trying to get data to populate a Sankey graph.
I have the data with timestamp and the location of the person when captured by the system.
The normal case is when a person location changes and in that case, "from" should be that location and "to" should be that person's next entry provided it's less than 2h difference from "from".
Original data:
date, macAdress, zone
8h10m, 00-B0-D0-63-C2-26, room1
8h12m, 00-B0-D0-63-C2-26, hall
8h15m, 00-A0-B0-23-T2-22, room1
8h16m, 00-A0-B0-23-T2-22, meeting2
8h18m, 00-B0-D0-63-C2-26, meeting2
8h25m, 00-A0-B0-23-T2-22, cafetaria
8h30m, 00-G5-A8-44-T2-30, room1
8h34m, 00-G5-A8-44-T2-30, meeting2
14h05m, 00-G5-A8-44-T2-30, cafetaria
Result required in the following way (or similar):
from, to, count
room1, hall, 1
hall, meeting2, 2
room1, meeting2, 1
meeting2, cafetaria, 1 <-- special case as time from last zone is longer than 2h it didn't count "14h05m, 00-G5-A8-44-T2-30, cafetaria"
Can you give me an idea of how can I get such result in SQL? I'm using BigQuery, but I believe standard SQL should do the job.
Thanks,
Diogo
CodePudding user response:
Consider below
with temp as (
select *, parse_time('%Hh%Mm', date) time
from your_table
), from_to as (
select
zone as _from, lead(zone) over win as _to,
time_diff(lead(time) over win, time, minute) as duration
from temp
window win as (partition by macAdress order by time)
)
select _from, _to, count(*) _count
from from_to
where not _to is null
and duration < 120
group by _from, _to
if applied to sample data as in your question
output is