Home > front end >  row_number() expression question in SQL Presto
row_number() expression question in SQL Presto

Time:12-01

Sample table:

object_id event_time event_type event_subtype stage
1 2022-10-01 create name, stage A
1 2022-10-02 update stage B
1 2022-10-03 update stage C
1 2022-10-04 update stage A
2 2022-10-01 create name, stage A
2 2022-10-02 update stage C
2 2022-10-03 update stage A
2 2022-10-04 update stage B
2 2022-10-05 update stage C
2 2022-10-06 update stage A

So what I need is a column that numbers the rows based on the stage - after an object_id reaches stage C, the row number of the same object_id should be incremented. It'd look like this:

object_id event_time event_type event_subtype stage row_number
1 2022-10-01 create name, stage A 1
1 2022-10-02 update stage B 1
1 2022-10-03 update stage C 1
1 2022-10-04 update stage A 2
2 2022-10-01 create name, stage A 1
2 2022-10-02 update stage C 1
2 2022-10-03 update stage A 2
2 2022-10-04 update stage B 2
2 2022-10-05 update stage C 2
2 2022-10-06 update stage A 3

The table must be ordered by object_id, event_time. I'm having troublle writing the window function that does this, this is what I've tried:

row_number() over (partition by object_id, stage order by event_time)

It just doesn't work for all cases. Also I'm having a hard time understanding how this would work when I'm not defining stage = C as the delimiter anywhere. Any ideas?

Thanks!

CodePudding user response:

I would suggest using rolling sum based on previous value of stage:

-- sample data
with dataset(object_id, event_time, event_type, event_subtype, stage) as (
    values    (1, '2022-10-01', 'create',   'name, stage', 'A'),
    (1, '2022-10-02', 'update', 'stage', 'B'),
    (1, '2022-10-03', 'update', 'stage', 'C'),
    (1, '2022-10-04', 'update', 'stage', 'A'),
    (2, '2022-10-01', 'create', 'name, stage',' A'),
    (2, '2022-10-02', 'update', 'stage', 'C'),
    (2, '2022-10-03', 'update', 'stage', 'A'),
    (2, '2022-10-04', 'update', 'stage', 'B'),
    (2, '2022-10-05', 'update', 'stage', 'C'),
    (2, '2022-10-06', 'update', 'stage', 'A')
)

-- query
select object_id,
       event_time,
       event_type,
       event_subtype,
       stage,
       1   sum(counter) over (partition by object_id order by event_time) as num
from (select *,
             if(lag(stage) over (partition by object_id order by event_time) = 'C', 1, 0) counter
      from dataset);

Output:

object_id event_time event_type event_subtype stage num
1 2022-10-01 create name, stage A 1
1 2022-10-02 update stage B 1
1 2022-10-03 update stage C 1
1 2022-10-04 update stage A 2
2 2022-10-01 create name, stage A 1
2 2022-10-02 update stage C 1
2 2022-10-03 update stage A 2
2 2022-10-04 update stage B 2
2 2022-10-05 update stage C 2
2 2022-10-06 update stage A 3

CodePudding user response:

It is not the problem of the row_number to give you the final "correct" order.

for you mist use the ORDER BY

SELECT
"object_id", "event_time", "event_type", "event_subtype", "stage",
ROW_NUMBER() OVER(PARTITION BY "object_id","stage" ORDER BY "event_time") rn
  FROM tab1
  ORDER BY "object_id",rn,"stage"
object_id event_time event_type event_subtype stage rn
1 2022-10-01 create name, stage A 1
1 2022-10-02 update stage B 1
1 2022-10-03 update stage C 1
1 2022-10-04 update stage A 2
2 2022-10-01 create name, stage A 1
2 2022-10-04 update stage B 1
2 2022-10-02 update stage C 1
2 2022-10-03 update stage A 2
2 2022-10-05 update stage C 2
2 2022-10-06 update stage A 3
  • Related