I have a dataset like the above. Basically it's like a website session which I have a start end activity, and in between there could be other unrelated records.
I need to pull the records based on:
- I need to report all the "end" activity from the table
- In each "end" record, I need to find its nearest previous "start" row's "Content_ID" value. The record must also match the "Profile_ID" as well
- The result I expected is as follows:
May I ask how could I construct the SQL to do this?
Thanks so much!
CodePudding user response:
You might consider below in BigQuery.
SELECT * EXCEPT(Content_ID),
LAST_VALUE(IF(Activity = 'start', Content_ID, NULL) IGNORE NULLS) OVER w AS Content_ID
FROM sample_table
QUALIFY Activity = 'end'
WINDOW w AS (PARTITION BY Profile_ID ORDER BY Created_Time);
Query results
CodePudding user response:
We can address this as a gaps-and-island problem, where an island begins every time an activity start
s for a given profile_id
: we can define the groups with a window sum.
Then, for every end activity, we need to report the content id of the beginning of the island.
select profile_id, created_time,
max(case when activity = 'start' then content_id end)
over(partition by profile_id, grp) content_id
from (
select t.*,
sum(case when activity = 'start' then 1 else 0 end)
over(partition by profile_id order by created_time) grp
from mytable t
) t
where activity = 'end'
Depending on the database you actually use (BQ or MySQL), some syntax shortcuts exists to shorten the conditional expressions. In BQ for example, we would phrase the window sum with countif
:
countif(activity = 'start')
over(partition by profile_id order by created_time) grp