Home > Mobile >  Using a non-null conditional LAG() function in MySQL
Using a non-null conditional LAG() function in MySQL

Time:12-23

I tried applying a couple solutions from here, but my question seems to be somewhat different from the OP's from this post.


I have a large dataset data in MySQL:

id          date          val
aaaaa       2021-01-01    TRUE
aaaaa       2021-01-02    FALSE
aaaaa       2021-01-03    FALSE
aaaaa       2021-01-04    TRUE
aaaaa       2021-01-05    FALSE
aaaaa       2021-01-06    TRUE
aaaaa       2021-01-07    FALSE
...
aaaaa       2021-12-31    FALSE
aaaab       2021-01-01    TRUE
aaaab       2021-01-02    FALSE
...
zzzzz       2021-12-31    FALSE

Here, id is a string-type data, date ranges from 2021-01-01 to 2021-12-31 without any missing days, and val contains a boolean value, TRUE or FALSE. data is ordered by id, date.

I would like to add two columns, lagged_date and date_diff.

  • lagged_date contains the previous date of the id where val = TRUE.
  • date_diff calculates the difference of the number of days between date and lagged_date in that row.

Ideally, my final dataset should look like this:

id          date          val        lagged_date     date_diff
aaaaa       2021-01-01    TRUE       NULL            NULL
aaaaa       2021-01-02    FALSE      2021-01-01      1
aaaaa       2021-01-03    FALSE      2021-01-01      2
aaaaa       2021-01-04    TRUE       2021-01-01      3
aaaaa       2021-01-05    FALSE      2021-01-04      1
aaaaa       2021-01-06    TRUE       2021-01-04      2
aaaaa       2021-01-07    FALSE      2021-01-06      1
...
aaaaa       2021-12-31    FALSE      2021-12-25      6
aaaab       2021-01-01    TRUE       NULL            NULL
aaaab       2021-01-02    FALSE      2021-01-01      1
...

(Note that this data is also ordered by id, date)

I tried a following query:

SELECT *,
       MAX(val) OVER (
          PARTITION BY id, val
          ORDER BY date
          ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
       ) AS lagged_date,
       DATE_DIFF(date, lagged_date, DAY) AS date_diff
  FROM data

but the lagged_date does not produce my desired output, only produces the lagged val. I tried MAX(date) also, but to no avail.

Any insight is appreciated.

CodePudding user response:

Use subquery to find the previous date for each 'id' where is 'TRUE' then use DATE_DIFF function.

SELECT t1.id, t1.date, t1.val,
  (SELECT t2.date FROM data t2 WHERE t2.id = t1.id AND t2.val = TRUE AND t2.date < t1.date ORDER BY t2.date DESC LIMIT 1) as lagged_date,
  DATEDIFF(t1.date, (SELECT t2.date FROM data t2 WHERE t2.id = t1.id AND t2.val = TRUE AND t2.date < t1.date ORDER BY t2.date DESC LIMIT 1)) as date_diff
FROM data t1
  • Related