I tried applying a couple solutions from here, but my question seems to be somewhat different from the OP's from this post.
I have a large dataset data
in MySQL:
id date val
aaaaa 2021-01-01 TRUE
aaaaa 2021-01-02 FALSE
aaaaa 2021-01-03 FALSE
aaaaa 2021-01-04 TRUE
aaaaa 2021-01-05 FALSE
aaaaa 2021-01-06 TRUE
aaaaa 2021-01-07 FALSE
...
aaaaa 2021-12-31 FALSE
aaaab 2021-01-01 TRUE
aaaab 2021-01-02 FALSE
...
zzzzz 2021-12-31 FALSE
Here, id
is a string-type data, date
ranges from 2021-01-01
to 2021-12-31
without any missing days, and val
contains a boolean value, TRUE
or FALSE
. data
is ordered by id, date
.
I would like to add two columns, lagged_date
and date_diff
.
lagged_date
contains the previous date of theid
whereval = TRUE
.date_diff
calculates the difference of the number of days betweendate
andlagged_date
in that row.
Ideally, my final dataset should look like this:
id date val lagged_date date_diff
aaaaa 2021-01-01 TRUE NULL NULL
aaaaa 2021-01-02 FALSE 2021-01-01 1
aaaaa 2021-01-03 FALSE 2021-01-01 2
aaaaa 2021-01-04 TRUE 2021-01-01 3
aaaaa 2021-01-05 FALSE 2021-01-04 1
aaaaa 2021-01-06 TRUE 2021-01-04 2
aaaaa 2021-01-07 FALSE 2021-01-06 1
...
aaaaa 2021-12-31 FALSE 2021-12-25 6
aaaab 2021-01-01 TRUE NULL NULL
aaaab 2021-01-02 FALSE 2021-01-01 1
...
(Note that this data is also ordered by id, date
)
I tried a following query:
SELECT *,
MAX(val) OVER (
PARTITION BY id, val
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
) AS lagged_date,
DATE_DIFF(date, lagged_date, DAY) AS date_diff
FROM data
but the lagged_date
does not produce my desired output, only produces the lagged val
. I tried MAX(date)
also, but to no avail.
Any insight is appreciated.
CodePudding user response:
Use subquery to find the previous date for each 'id' where is 'TRUE' then use DATE_DIFF function.
SELECT t1.id, t1.date, t1.val,
(SELECT t2.date FROM data t2 WHERE t2.id = t1.id AND t2.val = TRUE AND t2.date < t1.date ORDER BY t2.date DESC LIMIT 1) as lagged_date,
DATEDIFF(t1.date, (SELECT t2.date FROM data t2 WHERE t2.id = t1.id AND t2.val = TRUE AND t2.date < t1.date ORDER BY t2.date DESC LIMIT 1)) as date_diff
FROM data t1