I'm working with a product hierarchy where one or more products belong to a product group, and one or more product groups belong to a product family:
dim_product
pid | fam | grp | prd |
---|---|---|---|
1 | A | A1 | A1a |
2 | A | A1 | A1b |
3 | A | A2 | A2a |
4 | B | B1 | B1a |
I have a time series of user activity, where each row represents a user generating revenue for a given product on a given date:
fact_user_activity
date | uid | pid |
---|---|---|
220101 | 101 | 1 |
220101 | 747 | 1 |
220101 | 101 | 2 |
220102 | 101 | 3 |
220102 | 747 | 4 |
220103 | 101 | 3 |
I need to produce a table such that; for each entry in fact_user_activity
, I denormalize the product hierarchy (family, type, name) and provide the elapsed time since:
- The user's previous activity
- The user's previous activity with the product family
- The user's previous activity with the product type
- The user's previous activity with the product
For the above example, the table I need to build would look like this:
fact_user_activity_delta
(td = time delta in days)
date | uid | pid | fam | grp | prd | td_act | td_fam | td_grp | td_prd |
---|---|---|---|---|---|---|---|---|---|
220101 | 101 | 1 | A | A1 | A1a | Null | Null | Null | Null |
220101 | 747 | 1 | A | A1 | A1a | Null | Null | Null | Null |
220101 | 101 | 2 | A | A1 | A1b | Null | Null | Null | Null |
220102 | 101 | 3 | A | A2 | A2a | 1 | 1 | Null | Null |
220102 | 747 | 4 | B | B1 | B1a | 1 | Null | Null | Null |
220103 | 101 | 3 | A | A2 | A2a | 1 | 1 | 1 | 1 |
My approach was first to rank the entries and track the first activity date for each hierarchy:
CREATE VIEW ranked_user_activity_vw AS
SELECT *,
DENSE_RANK() OVER (PARTITION BY uid ORDER BY date DESC) act_rank,
FIRST_VALUE(date) OVER (PARTITION BY uid ORDER BY date ASC) first_act,
DENSE_RANK() OVER (PARTITION BY uid, fam ORDER BY date DESC) fam_rank,
FIRST_VALUE(date) OVER (PARTITION BY uid, fam ORDER BY date ASC) first_fam,
DENSE_RANK() OVER (PARTITION BY uid, fam, grp ORDER BY date DESC) grp_rank,
FIRST_VALUE(date) OVER (PARTITION BY uid, fam, grp ORDER BY date ASC) first_grp,
DENSE_RANK() OVER (PARTITION BY uid, fam, grp, prd ORDER BY date DESC) prd_rank,
FIRST_VALUE(date) OVER (PARTITION BY uid, fam, grp, prd ORDER BY date ASC) first_prd
FROM fact_user_activity
NATURAL JOIN dim_product
Then I tried to get the time deltas by self-joining ranked_user_activity:
CREATE TABLE fact_user_activity_delta AS
SELECT
b.date, b.uid, b.pid, b.fam, b.grp, b.prd,
IFF(b.date = b.first_act, NULL, b.date - _any.date) AS td_any,
IFF(b.date = b.first_fam, NULL, b.date - _fam.date) AS td_fam,
IFF(b.date = b.first_grp, NULL, b.date - _grp.date) AS td_grp,
IFF(b.date = b.first_prd, NULL, b.date - _prd.date) AS td_prd
FROM ranked_user_activity_vw AS b
LEFT JOIN ranked_user_activity_vw AS _act ON
b.date != b.first_act AND
b.uid = _act.uid
LEFT JOIN ranked_user_activity_vw AS _fam ON
b.date != b.first_fam AND
_act.uid = _fam.uid AND
_act.fam = _fam.fam
LEFT JOIN ranked_user_activity_vw AS _grp ON
b.date != b.first_grp AND
_fam.uid = _grp.uid AND
_fam.fam = _grp.fam AND
_fam.grp = _grp.grp
LEFT JOIN ranked_user_activity_vw AS _prd ON
b.date != b.first_prd AND
_grp.uid = _prd.uid AND
_grp.pid = _prd.pid
WHERE
(b.act_rank = b.first_act OR base.act_rank 1 = _act.act_rank) AND
(b.fam_rank = b.first_fam OR base.fam_rank 1 = _fam.fam_rank) AND
(b.grp_rank = b.first_grp OR base.grp_rank 1 = _grp.grp_rank) AND
(b.prd_rank = b.first_prd OR base.prd_rank 1 = _prd.prd_rank);
This seems to work, but it goes against a lot of advice in the Snowflake docs:
- The window functions have different partitions.
- Snowflake offers
- I'm pretty new to Snowflake in general and wonder if my solution is particularly inefficient.
Thank you!
CodePudding user response:
so using these CTE's for data:
with dim_product(pid, fam, grp, prd) as (
select * from values
(1, 'A', 'A1', 'A1a'),
(2, 'A', 'A1', 'A1b'),
(3, 'A', 'A2', 'A2a'),
(4, 'B', 'B1', 'B1a')
), fact_user_activity(date, uid, pid) as (
select * from values
(220101, 101, 1),
(220101, 747, 1),
(220101, 101, 2),
(220102, 101, 3),
(220102, 747, 4),
(220103, 101, 3)
)
we can solve this with LAG, and using NULLIFZERO to remove the zero day gaps.. but this aspect is messy when you have many same day transactions, could could been seen with a different sort order and/or different data.
this version shows the partial results:
select
*
,lag(date)ignore nulls over(partition by uid order by date) as ld
,nullifzero(date - ld) as td_act
,lag(date)ignore nulls over(partition by uid, fam order by date) as ldf
,nullifzero(date - ldf) as td_fam
,lag(date)ignore nulls over(partition by uid, fam, grp order by date) as ldfg
,nullifzero(date - ldfg) as td_grp
,lag(date)ignore nulls over(partition by uid, fam, grp, prd order by date) as ldfgp
,nullifzero(date - ldfgp) as td_prd
from fact_user_activity
natural JOIN dim_product
order by date, pid
PID | DATE | UID | FAM | GRP | PRD | LD | TD_ACT | LDF | TD_FAM | LDFG | TD_GRP | LDFGP | TD_PRD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 220,101 | 101 | A | A1 | A1a | ||||||||
1 | 220,101 | 747 | A | A1 | A1a | ||||||||
2 | 220,101 | 101 | A | A1 | A1b | 220,101 | 220,101 | 220,101 | |||||
3 | 220,102 | 101 | A | A2 | A2a | 220,101 | 1 | 220,101 | 1 | ||||
4 | 220,102 | 747 | B | B1 | B1a | 220,101 | 1 | ||||||
3 | 220,103 | 101 | A | A2 | A2a | 220,102 | 1 | 220,102 | 1 | 220,102 | 1 | 220,102 | 1 |
but this can be inlined once you see how it works:
select
*
,nullifzero(date - lag(date)ignore nulls over(partition by uid order by date) ) as td_act
,nullifzero(date - lag(date)ignore nulls over(partition by uid, fam order by date) ) as td_fam
,nullifzero(date - lag(date)ignore nulls over(partition by uid, fam, grp order by date)) as td_grp
,nullifzero(date - lag(date)ignore nulls over(partition by uid, fam, grp, prd order by date)) as td_prd
from fact_user_activity
natural JOIN dim_product
order by date, pid
PID | DATE | UID | FAM | GRP | PRD | TD_ACT | TD_FAM | TD_GRP | TD_PRD |
---|---|---|---|---|---|---|---|---|---|
1 | 220,101 | 101 | A | A1 | A1a | ||||
1 | 220,101 | 747 | A | A1 | A1a | ||||
2 | 220,101 | 101 | A | A1 | A1b | ||||
3 | 220,102 | 101 | A | A2 | A2a | 1 | 1 | ||
4 | 220,102 | 747 | B | B1 | B1a | 1 | |||
3 | 220,103 | 101 | A | A2 | A2a | 1 | 1 | 1 | 1 |