Sample table:
object_id | event_time | event_type | event_subtype | stage |
---|---|---|---|---|
1 | 2022-10-01 | create | name, stage | A |
1 | 2022-10-02 | update | stage | B |
1 | 2022-10-03 | update | stage | C |
1 | 2022-10-04 | update | stage | A |
2 | 2022-10-01 | create | name, stage | A |
2 | 2022-10-02 | update | stage | C |
2 | 2022-10-03 | update | stage | A |
2 | 2022-10-04 | update | stage | B |
2 | 2022-10-05 | update | stage | C |
2 | 2022-10-06 | update | stage | A |
So what I need is a column that numbers the rows based on the stage - after an object_id reaches stage C, the row number of the same object_id should be incremented. It'd look like this:
object_id | event_time | event_type | event_subtype | stage | row_number |
---|---|---|---|---|---|
1 | 2022-10-01 | create | name, stage | A | 1 |
1 | 2022-10-02 | update | stage | B | 1 |
1 | 2022-10-03 | update | stage | C | 1 |
1 | 2022-10-04 | update | stage | A | 2 |
2 | 2022-10-01 | create | name, stage | A | 1 |
2 | 2022-10-02 | update | stage | C | 1 |
2 | 2022-10-03 | update | stage | A | 2 |
2 | 2022-10-04 | update | stage | B | 2 |
2 | 2022-10-05 | update | stage | C | 2 |
2 | 2022-10-06 | update | stage | A | 3 |
The table must be ordered by object_id, event_time. I'm having troublle writing the window function that does this, this is what I've tried:
row_number() over (partition by object_id, stage order by event_time)
It just doesn't work for all cases. Also I'm having a hard time understanding how this would work when I'm not defining stage = C
as the delimiter anywhere. Any ideas?
Thanks!
CodePudding user response:
I would suggest using rolling sum based on previous value of stage
:
-- sample data
with dataset(object_id, event_time, event_type, event_subtype, stage) as (
values (1, '2022-10-01', 'create', 'name, stage', 'A'),
(1, '2022-10-02', 'update', 'stage', 'B'),
(1, '2022-10-03', 'update', 'stage', 'C'),
(1, '2022-10-04', 'update', 'stage', 'A'),
(2, '2022-10-01', 'create', 'name, stage',' A'),
(2, '2022-10-02', 'update', 'stage', 'C'),
(2, '2022-10-03', 'update', 'stage', 'A'),
(2, '2022-10-04', 'update', 'stage', 'B'),
(2, '2022-10-05', 'update', 'stage', 'C'),
(2, '2022-10-06', 'update', 'stage', 'A')
)
-- query
select object_id,
event_time,
event_type,
event_subtype,
stage,
1 sum(counter) over (partition by object_id order by event_time) as num
from (select *,
if(lag(stage) over (partition by object_id order by event_time) = 'C', 1, 0) counter
from dataset);
Output:
object_id | event_time | event_type | event_subtype | stage | num |
---|---|---|---|---|---|
1 | 2022-10-01 | create | name, stage | A | 1 |
1 | 2022-10-02 | update | stage | B | 1 |
1 | 2022-10-03 | update | stage | C | 1 |
1 | 2022-10-04 | update | stage | A | 2 |
2 | 2022-10-01 | create | name, stage | A | 1 |
2 | 2022-10-02 | update | stage | C | 1 |
2 | 2022-10-03 | update | stage | A | 2 |
2 | 2022-10-04 | update | stage | B | 2 |
2 | 2022-10-05 | update | stage | C | 2 |
2 | 2022-10-06 | update | stage | A | 3 |
CodePudding user response:
It is not the problem of the row_number to give you the final "correct" order.
for you mist use the ORDER BY
SELECT
"object_id", "event_time", "event_type", "event_subtype", "stage",
ROW_NUMBER() OVER(PARTITION BY "object_id","stage" ORDER BY "event_time") rn
FROM tab1
ORDER BY "object_id",rn,"stage"
object_id | event_time | event_type | event_subtype | stage | rn |
---|---|---|---|---|---|
1 | 2022-10-01 | create | name, stage | A | 1 |
1 | 2022-10-02 | update | stage | B | 1 |
1 | 2022-10-03 | update | stage | C | 1 |
1 | 2022-10-04 | update | stage | A | 2 |
2 | 2022-10-01 | create | name, stage | A | 1 |
2 | 2022-10-04 | update | stage | B | 1 |
2 | 2022-10-02 | update | stage | C | 1 |
2 | 2022-10-03 | update | stage | A | 2 |
2 | 2022-10-05 | update | stage | C | 2 |
2 | 2022-10-06 | update | stage | A | 3 |