Home > Software engineering >  Snowflake SQL hierarchical time between events
Snowflake SQL hierarchical time between events

Time:10-11

I'm working with a product hierarchy where one or more products belong to a product group, and one or more product groups belong to a product family:

dim_product

pid fam grp prd
1 A A1 A1a
2 A A1 A1b
3 A A2 A2a
4 B B1 B1a

I have a time series of user activity, where each row represents a user generating revenue for a given product on a given date:

fact_user_activity

date uid pid
220101 101 1
220101 747 1
220101 101 2
220102 101 3
220102 747 4
220103 101 3

I need to produce a table such that; for each entry in fact_user_activity, I denormalize the product hierarchy (family, type, name) and provide the elapsed time since:

  1. The user's previous activity
  2. The user's previous activity with the product family
  3. The user's previous activity with the product type
  4. The user's previous activity with the product

For the above example, the table I need to build would look like this:

fact_user_activity_delta (td = time delta in days)

date uid pid fam grp prd td_act td_fam td_grp td_prd
220101 101 1 A A1 A1a Null Null Null Null
220101 747 1 A A1 A1a Null Null Null Null
220101 101 2 A A1 A1b Null Null Null Null
220102 101 3 A A2 A2a 1 1 Null Null
220102 747 4 B B1 B1a 1 Null Null Null
220103 101 3 A A2 A2a 1 1 1 1

My approach was first to rank the entries and track the first activity date for each hierarchy:

CREATE VIEW ranked_user_activity_vw AS
  SELECT *,
    DENSE_RANK() OVER (PARTITION BY uid ORDER BY date DESC) act_rank,
    FIRST_VALUE(date) OVER (PARTITION BY uid ORDER BY date ASC) first_act,
    DENSE_RANK() OVER (PARTITION BY uid, fam ORDER BY date DESC) fam_rank,
    FIRST_VALUE(date) OVER (PARTITION BY uid, fam ORDER BY date ASC) first_fam,
    DENSE_RANK() OVER (PARTITION BY uid, fam, grp ORDER BY date DESC) grp_rank,
    FIRST_VALUE(date) OVER (PARTITION BY uid, fam, grp ORDER BY date ASC) first_grp,
    DENSE_RANK() OVER (PARTITION BY uid, fam, grp, prd ORDER BY date DESC) prd_rank,
    FIRST_VALUE(date) OVER (PARTITION BY uid, fam, grp, prd ORDER BY date ASC) first_prd
  FROM fact_user_activity 
    NATURAL JOIN dim_product 

Then I tried to get the time deltas by self-joining ranked_user_activity:

CREATE TABLE fact_user_activity_delta AS 
  SELECT 
    b.date, b.uid, b.pid, b.fam, b.grp, b.prd,
    IFF(b.date = b.first_act, NULL, b.date - _any.date) AS td_any,
    IFF(b.date = b.first_fam, NULL, b.date - _fam.date) AS td_fam,
    IFF(b.date = b.first_grp, NULL, b.date - _grp.date) AS td_grp,
    IFF(b.date = b.first_prd, NULL, b.date - _prd.date) AS td_prd
  FROM ranked_user_activity_vw AS b 
    LEFT JOIN ranked_user_activity_vw AS _act ON
      b.date != b.first_act AND
      b.uid = _act.uid
    LEFT JOIN ranked_user_activity_vw AS _fam ON
      b.date != b.first_fam AND
      _act.uid = _fam.uid AND
      _act.fam = _fam.fam
    LEFT JOIN ranked_user_activity_vw AS _grp ON
      b.date != b.first_grp AND
      _fam.uid = _grp.uid AND
      _fam.fam = _grp.fam AND
      _fam.grp = _grp.grp
    LEFT JOIN ranked_user_activity_vw AS _prd ON
      b.date != b.first_prd AND
      _grp.uid = _prd.uid AND
      _grp.pid = _prd.pid
  WHERE
    (b.act_rank = b.first_act OR base.act_rank   1 = _act.act_rank) AND
    (b.fam_rank = b.first_fam OR base.fam_rank   1 = _fam.fam_rank) AND
    (b.grp_rank = b.first_grp OR base.grp_rank   1 = _grp.grp_rank) AND
    (b.prd_rank = b.first_prd OR base.prd_rank   1 = _prd.prd_rank);

This seems to work, but it goes against a lot of advice in the Snowflake docs:

  1. The window functions have different partitions.
  2. Snowflake offers
  3. I'm pretty new to Snowflake in general and wonder if my solution is particularly inefficient.

Thank you!

CodePudding user response:

so using these CTE's for data:

with dim_product(pid, fam, grp, prd) as (
    select * from values
    (1, 'A', 'A1', 'A1a'),
    (2, 'A', 'A1', 'A1b'),
    (3, 'A', 'A2', 'A2a'),
    (4, 'B', 'B1', 'B1a')
), fact_user_activity(date, uid,    pid) as (
    select * from values
    (220101, 101, 1),
    (220101, 747, 1),
    (220101, 101, 2),
    (220102, 101, 3),
    (220102, 747, 4),
    (220103, 101, 3)
)

we can solve this with LAG, and using NULLIFZERO to remove the zero day gaps.. but this aspect is messy when you have many same day transactions, could could been seen with a different sort order and/or different data.

this version shows the partial results:

select 
    *
    ,lag(date)ignore nulls over(partition by uid order by date) as ld
    ,nullifzero(date - ld) as td_act
    ,lag(date)ignore nulls over(partition by uid, fam order by date) as ldf
    ,nullifzero(date - ldf) as td_fam
    ,lag(date)ignore nulls over(partition by uid, fam, grp order by date) as ldfg
    ,nullifzero(date - ldfg) as td_grp
    ,lag(date)ignore nulls over(partition by uid, fam, grp, prd order by date) as ldfgp
    ,nullifzero(date - ldfgp) as td_prd
from fact_user_activity
natural JOIN dim_product
order by date, pid
PID DATE UID FAM GRP PRD LD TD_ACT LDF TD_FAM LDFG TD_GRP LDFGP TD_PRD
1 220,101 101 A A1 A1a
1 220,101 747 A A1 A1a
2 220,101 101 A A1 A1b 220,101 220,101 220,101
3 220,102 101 A A2 A2a 220,101 1 220,101 1
4 220,102 747 B B1 B1a 220,101 1
3 220,103 101 A A2 A2a 220,102 1 220,102 1 220,102 1 220,102 1

but this can be inlined once you see how it works:

select 
    *
    ,nullifzero(date - lag(date)ignore nulls over(partition by uid order by date) ) as td_act
    ,nullifzero(date - lag(date)ignore nulls over(partition by uid, fam order by date) ) as td_fam
    ,nullifzero(date - lag(date)ignore nulls over(partition by uid, fam, grp order by date)) as td_grp
    ,nullifzero(date - lag(date)ignore nulls over(partition by uid, fam, grp, prd order by date)) as td_prd
from fact_user_activity
natural JOIN dim_product
order by date, pid
PID DATE UID FAM GRP PRD TD_ACT TD_FAM TD_GRP TD_PRD
1 220,101 101 A A1 A1a
1 220,101 747 A A1 A1a
2 220,101 101 A A1 A1b
3 220,102 101 A A2 A2a 1 1
4 220,102 747 B B1 B1a 1
3 220,103 101 A A2 A2a 1 1 1 1
  • Related