Below is the base table of data..
date | customer_id | score1 | score2 |
---|---|---|---|
01/01/22 | a | 1 | 1 |
02/01/22 | a | 1 | 1 |
01/01/22 | b | 2 | 2 |
02/01/22 | b | 4 | 1 |
01/01/22 | c | 1 | 1 |
02/01/22 | c | 1 | 4 |
01/01/22 | d | 5 | 1 |
02/01/22 | d | 10 | 1 |
This is the result that I want to achieve, where I only pull through the rows where there has been a change in either score1 or score2 from the previous date. In this case, b has gone from 2 to 4 on score1 and 2 to 1 on score2, c has gone from 1 to 4 on score 2, d has gone from 5 to 10 on score 1.
date | customer_id | score1 | score2 |
---|---|---|---|
02/01/22 | b | 4 | 1 |
02/01/22 | c | 1 | 4 |
02/01/22 | d | 10 | 1 |
Unsure if there is a function to do this. Altenatively, would it be best to have two separate tables initially and use a join to achieve this. Using SQL presto if that helps.
Many thanks!
CodePudding user response:
We can use the LAG()
window function here:
WITH cte AS (
SELECT t.*, LAG(score, 1, score1) OVER (PARTITION BY customer_id
ORDER BY date) AS lag_score_1,
LAG(score, 1, score2) OVER (PARTITION BY customer_id
ORDER BY date) AS lag_score_2
FROM yourTable t
)
SELECT date, customer_id, score
FROM cte
WHERE score1 <> lag_score_1 OR score2 <> lag_score_2
ORDER BY date;
This answer uses the 3 parameter version of LAG()
. The second parameter specifies a step of 1, while the third parameter specifies a default value in case there is no previous value. In this case, we use score
as the default such that the earliest record in each customer partition is ignored.