I have a dataset where there is an id
which has steps. Each step has a timestamp and the channel name. A channel can repeat several times at multiple timestamps for a given id.
I am trying to measure for each chunk of the repeating channels (ordered by the timestamp), how many occurrences happen?
Here is my sample data -
with temp as (
select 1 as id, '2019-08-02 13:13:27 UTC' as t_date, 'email' as channel union all
select 1 as id, '2019-08-02 13:14:27 UTC' as t_date, 'email' as channel union all
select 1 as id, '2019-08-02 13:15:27 UTC' as t_date, 'display' as channel union all
select 1 as id, '2019-08-02 13:16:27 UTC' as t_date, 'display' as channel union all
select 1 as id, '2019-08-02 13:17:27 UTC' as t_date, 'email' as channel union all
select 1 as id, '2019-08-02 13:18:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:11:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:12:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:13:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:14:27 UTC' as t_date, 'email' as channel
)
select id, channel , count(1) appearances
from temp
group by id , channel
order by id
However, I need something like this -
As shown in the output, for each sequences of the channels occurring together, I need to count the appearances
and the start and end times. For instance, the first record in the output belongs to email
channel that started for id = 1 at 2019-08-02 13:13:27 UTC
and ended at 2019-08-02 13:14:27 UTC
- ordered by timestamp. The last column shows many how many times email
channel repeats before it changes to next one (display in this case).
How can I achieve this in BigQuery?
CodePudding user response:
Consider below approach
select id, channel,
min(t_date) as start_date,
max(t_date) as end_date,
count(1) as appearances
from (
select *, countif(new_group) over (partition by id order by t_date) group_id
from (
select *, ifnull(channel != lag(channel) over win, true) new_group
from temp
window win as (partition by id order by t_date)
)
)
group by id, channel, group_id
if applied to sample data in your question - output is