I have a table that records user transactions like this (simplified version) in BigQuery
user | transaction_date | label | cost |
---|---|---|---|
a | 2021-10-31 10:30:00 | y1 | 10 |
b | 2021-10-31 10:30:00 | y2 | 10 |
c | 2021-10-31 10:30:00 | y1 | 10 |
a | 2021-11-31 10:30:00 | y1 | 10 |
a | 2021-12-31 10:30:00 | y2 | 10 |
b | 2021-11-31 10:30:00 | y3 | 10 |
c | 2021-11-31 10:30:00 | y1 | 10 |
b | 2021-12-31 10:30:00 | y2 | 10 |
c | 2021-12-31 10:30:00 | y1 | 10 |
I am interested in information related to cost and current/next label. How can I use LEAD() to return the next different label as label_next?
i.e
user | transaction_date | label | cost | label_next |
---|---|---|---|---|
a | 2021-10-31 10:30:00 | y1 | 10 | y2 |
b | 2021-10-31 10:30:00 | y2 | 10 | y3 |
c | 2021-10-31 10:30:00 | y1 | 10 | y3 |
a | 2021-11-31 10:30:00 | y1 | 10 | y2 |
a | 2021-12-31 10:30:00 | y2 | 10 | y5 |
b | 2021-11-31 10:30:00 | y3 | 10 | y2 |
c | 2021-11-31 10:30:00 | y1 | 10 | y3 |
b | 2021-12-31 10:30:00 | y2 | 10 | null |
c | 2021-12-31 10:30:00 | y3 | 10 | null |
a | 2021-12-31 18:30:00 | y5 | 10 | null |
standard LEAD() would return just the next label i.e. for row 1 (user a) would return y1 as the same user is registered with the same label once before being seen with a different label.
I think I have one solution that involves: grouping by user, label and calculate min and max transaction_date then use LEAD() to get label_next per grouping and join that table on the initial table on user and transaction_date inside min and max transaction_date
But is there a way to do it differently?
CodePudding user response:
We can use a conditional aggregation approach here:
WITH cte AS (
SELECT *, COUNT(CASE WHEN LEAD(label) OVER (PARTITION BY user
ORDER BY transaction_date) != label
THEN 1 END) AS cnt
FROM yourTable
),
cte2 AS (
SELECT DISTINCT user, cnt, label
FROM cte
)
SELECT t1.user, t1.transaction_date, t1.label, t2.label AS label_next
FROM cte t1
LEFT JOIN cte2 t2
ON t2.user = t1.user AND
t2.cnt = t1.cnt 1
ORDER BY t1.user, t1.transaction_date;
The idea here is to generate a running count, across each record for each user, which increments by 1 every time the label changes. We use this result to generate (in cte2
) an intermediate table where, given a count value and user, we can lookup the next different label. The final left join is where this lookup is taking place.
CodePudding user response:
Another simple approach we can take that doesn't involve lead function is to find the next label for a user where the label value is not matching current label and the transaction date is greater than current date.
with temp as (select 'a' user, '2021-10-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'a' user, '2021-11-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'a' user, '2021-12-31 10:30:00 ' as transaction_date, 'y2' as label, '10' as cost
union all
select 'a' user, '2021-12-31 20:30:00 ' as transaction_date, 'y5' as label, '10' as cost
union all
select 'b' user, '2021-10-31 10:30:00' as transaction_date, 'y2' as label, '10' as cost
union all
select 'b' user, '2021-11-31 10:30:00' as transaction_date, 'y3' as label, '10' as cost
union all
select 'b' user, '2021-12-31 10:30:00 ' as transaction_date, 'y2' as label, '10' as cost
union all
select 'c' user, '2021-10-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'c' user, '2021-11-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'c' user, '2021-12-31 10:30:00 ' as transaction_date, 'y3' as label, '10' as cost
union all
select 'c' user, '2021-12-31 20:30:00 ' as transaction_date, 'y4' as label, '10' as cost
)
select user, transaction_date, label, cost, (select min(label) from temp b where a.label <> b.label and
a.user = b.user and b.transaction_date>a.transaction_date
) as label_next
from temp a