I have data of the following structure:
library(data.table)
dt <- data.table(ID = "ABC",
TIME_FROM = structure(c(17897L, 18262L, 18353L, 18628L,
18659L, 18779L, 18993L, 19358L),
class = c("IDate", "Date")),
TIME_TO = structure(c(18261L, 18352L, 18627L, 18658L,
18778L, 18992L, 19357L, 19722L),
class = c("IDate", "Date")),
VALID = c(T, T, T, T, F, T, T, T))
dt
#> ID TIME_FROM TIME_TO VALID
#> 1: ABC 2019-01-01 2019-12-31 TRUE
#> 2: ABC 2020-01-01 2020-03-31 TRUE
#> 3: ABC 2020-04-01 2020-12-31 TRUE
#> 4: ABC 2021-01-01 2021-01-31 TRUE
#> 5: ABC 2021-02-01 2021-05-31 FALSE
#> 6: ABC 2021-06-01 2021-12-31 TRUE
#> 7: ABC 2022-01-01 2022-12-31 TRUE
#> 8: ABC 2023-01-01 2023-12-31 TRUE
I want to determine, by ID
, full valid yearly circles that start at September 1st (e.g. 2020-09-01) and end at August 31st (e.g. 2021-08-31).
Is there a non-convoluted approach, preferrable with a pure data.table solution?
Expected Outcome
dt_target <- data.table(ID = "ABC",
YEAR = 2019:2024,
FULL_VALID_YR = c(F, T, F, T, T, F))
dt_target
#> ID YEAR FULL_VALID_YR
#> 1: ABC 2019 FALSE
#> 2: ABC 2020 TRUE
#> 3: ABC 2021 FALSE
#> 4: ABC 2022 TRUE
#> 5: ABC 2023 TRUE
#> 6: ABC 2024 FALSE
- 2019 is FALSE because it lacks the last 4 months of 2018 (September to December)
- 2020 is TRUE because the full year, starting from September 1st, 2019 until August 31st, 2020 is valid
- 2021 is FALSE because the time period from 2021-02-01 to 2021-05-31 is not valid
- 2022 is TRUE because the full year, starting from September 1st, 2021 until August 31st, 2022 is valid
- 2023 is TRUE because the full year, starting from September 1st, 2022 until August 31st, 2023 is valid
- 2024 is FALSE because it lacks the first 8 months of 2024 (January to August)
Final Notes:
Whether or not to include the non-full years (2019 & 2024) is optional since I will not consider them any further anyway.
It is, of course, arbitrary whether to use the start or the end (as I did in the example above) of the period to determine the years name.
CodePudding user response:
Not a data.table
solution, but should be possible to integrate it in a data.table
framework. I use the ivs
package to solve the problem:
dtValid <- dt[dt$VALID]
ivs <- iv_groups(iv(as.Date(dtValid$TIME_FROM), as.Date(dtValid$TIME_TO) 1))
#<iv<date>[2]>
#[1] [2019-01-01, 2021-02-01) [2021-06-01, 2024-01-01)
iv_yearly <-
iv(seq(as.Date("2018-09-01"), by = "1 year", length.out = 6),
seq(as.Date("2019-08-31"), by = "1 year", length.out = 6))
#<iv<date>[6]>
#[1] [2018-09-01, 2019-08-31) [2019-09-01, 2020-08-31)
#[3] [2020-09-01, 2021-08-31) [2021-09-01, 2022-08-31)
#[5] [2022-09-01, 2023-08-31) [2023-09-01, 2024-08-31)
iv_overlaps(iv_yearly, ivs, type = "within")
#[1] FALSE TRUE FALSE TRUE TRUE FALSE