How to use lead() efficiently and return next different value?-CodePudding

I have a table that records user transactions like this (simplified version) in BigQuery

user	transaction_date	label	cost
a	2021-10-31 10:30:00	y1	10
b	2021-10-31 10:30:00	y2	10
c	2021-10-31 10:30:00	y1	10
a	2021-11-31 10:30:00	y1	10
a	2021-12-31 10:30:00	y2	10
b	2021-11-31 10:30:00	y3	10
c	2021-11-31 10:30:00	y1	10
b	2021-12-31 10:30:00	y2	10
c	2021-12-31 10:30:00	y1	10

I am interested in information related to cost and current/next label. How can I use LEAD() to return the next different label as label_next?

i.e

user	transaction_date	label	cost	label_next
a	2021-10-31 10:30:00	y1	10	y2
b	2021-10-31 10:30:00	y2	10	y3
c	2021-10-31 10:30:00	y1	10	y3
a	2021-11-31 10:30:00	y1	10	y2
a	2021-12-31 10:30:00	y2	10	y5
b	2021-11-31 10:30:00	y3	10	y2
c	2021-11-31 10:30:00	y1	10	y3
b	2021-12-31 10:30:00	y2	10	null
c	2021-12-31 10:30:00	y3	10	null
a	2021-12-31 18:30:00	y5	10	null

standard LEAD() would return just the next label i.e. for row 1 (user a) would return y1 as the same user is registered with the same label once before being seen with a different label.

I think I have one solution that involves: grouping by user, label and calculate min and max transaction_date then use LEAD() to get label_next per grouping and join that table on the initial table on user and transaction_date inside min and max transaction_date

But is there a way to do it differently?

CodePudding user response：

We can use a conditional aggregation approach here:

WITH cte AS (
    SELECT *, COUNT(CASE WHEN LEAD(label) OVER (PARTITION BY user
                                                ORDER BY transaction_date) != label
                         THEN 1 END) AS cnt
    FROM yourTable
),
cte2 AS (
    SELECT DISTINCT user, cnt, label
    FROM cte
)

SELECT t1.user, t1.transaction_date, t1.label, t2.label AS label_next
FROM cte t1
LEFT JOIN cte2 t2
    ON t2.user = t1.user AND
       t2.cnt = t1.cnt   1
ORDER BY t1.user, t1.transaction_date;

The idea here is to generate a running count, across each record for each user, which increments by 1 every time the label changes. We use this result to generate (in cte2) an intermediate table where, given a count value and user, we can lookup the next different label. The final left join is where this lookup is taking place.

CodePudding user response：

Another simple approach we can take that doesn't involve lead function is to find the next label for a user where the label value is not matching current label and the transaction date is greater than current date.

    with temp as (select 'a' user, '2021-10-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'a' user, '2021-11-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'a' user, '2021-12-31 10:30:00   ' as transaction_date, 'y2' as label, '10' as cost
union all
select 'a' user, '2021-12-31 20:30:00   ' as transaction_date, 'y5' as label, '10' as cost
union all
select 'b' user, '2021-10-31 10:30:00' as transaction_date, 'y2' as label, '10' as cost
union all
select 'b' user, '2021-11-31 10:30:00' as transaction_date, 'y3' as label, '10' as cost
union all
select 'b' user, '2021-12-31 10:30:00   ' as transaction_date, 'y2' as label, '10' as cost
union all
select 'c' user, '2021-10-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'c' user, '2021-11-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'c' user, '2021-12-31 10:30:00   ' as transaction_date, 'y3' as label, '10' as cost
union all
select 'c' user, '2021-12-31 20:30:00   ' as transaction_date, 'y4' as label, '10' as cost
)
select user, transaction_date, label, cost, (select min(label) from temp b where a.label <> b.label and
 a.user = b.user and b.transaction_date>a.transaction_date
 ) as label_next
from temp a