Home > Back-end >  How to use lead() efficiently and return next different value?
How to use lead() efficiently and return next different value?

Time:10-31

I have a table that records user transactions like this (simplified version) in BigQuery

user transaction_date label cost
a 2021-10-31 10:30:00 y1 10
b 2021-10-31 10:30:00 y2 10
c 2021-10-31 10:30:00 y1 10
a 2021-11-31 10:30:00 y1 10
a 2021-12-31 10:30:00 y2 10
b 2021-11-31 10:30:00 y3 10
c 2021-11-31 10:30:00 y1 10
b 2021-12-31 10:30:00 y2 10
c 2021-12-31 10:30:00 y1 10

I am interested in information related to cost and current/next label. How can I use LEAD() to return the next different label as label_next?

i.e

user transaction_date label cost label_next
a 2021-10-31 10:30:00 y1 10 y2
b 2021-10-31 10:30:00 y2 10 y3
c 2021-10-31 10:30:00 y1 10 y3
a 2021-11-31 10:30:00 y1 10 y2
a 2021-12-31 10:30:00 y2 10 y5
b 2021-11-31 10:30:00 y3 10 y2
c 2021-11-31 10:30:00 y1 10 y3
b 2021-12-31 10:30:00 y2 10 null
c 2021-12-31 10:30:00 y3 10 null
a 2021-12-31 18:30:00 y5 10 null

standard LEAD() would return just the next label i.e. for row 1 (user a) would return y1 as the same user is registered with the same label once before being seen with a different label.

I think I have one solution that involves: grouping by user, label and calculate min and max transaction_date then use LEAD() to get label_next per grouping and join that table on the initial table on user and transaction_date inside min and max transaction_date

But is there a way to do it differently?

CodePudding user response:

We can use a conditional aggregation approach here:

WITH cte AS (
    SELECT *, COUNT(CASE WHEN LEAD(label) OVER (PARTITION BY user
                                                ORDER BY transaction_date) != label
                         THEN 1 END) AS cnt
    FROM yourTable
),
cte2 AS (
    SELECT DISTINCT user, cnt, label
    FROM cte
)

SELECT t1.user, t1.transaction_date, t1.label, t2.label AS label_next
FROM cte t1
LEFT JOIN cte2 t2
    ON t2.user = t1.user AND
       t2.cnt = t1.cnt   1
ORDER BY t1.user, t1.transaction_date;

The idea here is to generate a running count, across each record for each user, which increments by 1 every time the label changes. We use this result to generate (in cte2) an intermediate table where, given a count value and user, we can lookup the next different label. The final left join is where this lookup is taking place.

CodePudding user response:

Another simple approach we can take that doesn't involve lead function is to find the next label for a user where the label value is not matching current label and the transaction date is greater than current date.

    with temp as (select 'a' user, '2021-10-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'a' user, '2021-11-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'a' user, '2021-12-31 10:30:00   ' as transaction_date, 'y2' as label, '10' as cost
union all
select 'a' user, '2021-12-31 20:30:00   ' as transaction_date, 'y5' as label, '10' as cost
union all
select 'b' user, '2021-10-31 10:30:00' as transaction_date, 'y2' as label, '10' as cost
union all
select 'b' user, '2021-11-31 10:30:00' as transaction_date, 'y3' as label, '10' as cost
union all
select 'b' user, '2021-12-31 10:30:00   ' as transaction_date, 'y2' as label, '10' as cost
union all
select 'c' user, '2021-10-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'c' user, '2021-11-31 10:30:00' as transaction_date, 'y1' as label, '10' as cost
union all
select 'c' user, '2021-12-31 10:30:00   ' as transaction_date, 'y3' as label, '10' as cost
union all
select 'c' user, '2021-12-31 20:30:00   ' as transaction_date, 'y4' as label, '10' as cost
)
select user, transaction_date, label, cost, (select min(label) from temp b where a.label <> b.label and
 a.user = b.user and b.transaction_date>a.transaction_date
 ) as label_next
from temp a
  • Related