I have a data set with more than 2 million rows in a Oracle SQL table and want to do some association analysis on this. To apply a sequence mining algorithm to this data, I need a column named 'sequenceId' and a column 'eventId'.
The table structure looks like this:
- timestamp
- product_id
- user_id
Now I need an id that increments every time the user_id changes. How can I do that in Oracle SQL? I tried it in R but it takes more than 12 hours there...
Sample data:
timestamp | product_id | user_id |
---|---|---|
2019-10-01 12:12:24 UTC | 3806 | 535447446 |
2019-10-01 19:51:55 UTC | 3762 | 535447446 |
2019-10-02 18:09:34 UTC | 3806 | 552286734 |
2019-10-02 17:54:01 UTC | 3928 | 493964166 |
Expected result:
timestamp | product_id | user_id | sequence id |
---|---|---|---|
2019-10-01 12:12:24 UTC | 3806 | 535447446 | 1 |
2019-10-01 19:51:55 UTC | 3762 | 535447446 | 1 |
2019-10-02 18:09:34 UTC | 3806 | 552286734 | 2 |
2019-10-02 17:54:01 UTC | 3928 | 493964166 | 3 |
The id should increment when the user_id changes
CodePudding user response:
WITH
source_data AS (
SELECT '2019-10-01 12:12:24' AS timestamp, 3806 AS product_id, 535447446 AS user_id FROM DUAL UNION ALL
SELECT '2019-10-01 19:51:55', 3762, 535447446 FROM DUAL UNION ALL
SELECT '2019-10-02 18:09:34', 3806, 552286734 FROM DUAL UNION ALL
SELECT '2019-10-02 17:54:01', 3928, 493964166 FROM DUAL
),
cte AS (
SELECT timestamp,
product_id,
user_id,
CASE WHEN user_id = LAG(user_id) OVER (ORDER BY timestamp)
THEN 0
ELSE 1
END new_user
FROM source_data
)
SELECT timestamp,
product_id,
user_id,
SUM(new_user) OVER (ORDER BY timestamp) sequence_id
FROM cte;
TIMESTAMP PRODUCT_ID USER_ID SEQUENCE_ID 2019-10-01 12:12:24 3806 535447446 1 2019-10-01 19:51:55 3762 535447446 1 2019-10-02 17:54:01 3928 493964166 2 2019-10-02 18:09:34 3806 552286734 3