Aggregating Time Series Data by Arbitrary Time Periods-CodePudding

I have data of the following structure:

library(data.table)
dt <- data.table(ID = "ABC",
           TIME_FROM = structure(c(17897L, 18262L, 18353L, 18628L,
                                   18659L, 18779L, 18993L, 19358L),
                                 class = c("IDate", "Date")),
           TIME_TO = structure(c(18261L, 18352L, 18627L, 18658L, 
                                 18778L, 18992L, 19357L, 19722L),
                               class = c("IDate", "Date")),
           VALID = c(T, T, T, T, F, T, T, T))
dt
#>     ID  TIME_FROM    TIME_TO VALID
#> 1: ABC 2019-01-01 2019-12-31  TRUE
#> 2: ABC 2020-01-01 2020-03-31  TRUE
#> 3: ABC 2020-04-01 2020-12-31  TRUE
#> 4: ABC 2021-01-01 2021-01-31  TRUE
#> 5: ABC 2021-02-01 2021-05-31 FALSE
#> 6: ABC 2021-06-01 2021-12-31  TRUE
#> 7: ABC 2022-01-01 2022-12-31  TRUE
#> 8: ABC 2023-01-01 2023-12-31  TRUE

I want to determine, by ID, full valid yearly circles that start at September 1st (e.g. 2020-09-01) and end at August 31st (e.g. 2021-08-31). Is there a non-convoluted approach, preferrable with a pure data.table solution?

Expected Outcome

dt_target <- data.table(ID = "ABC",
                        YEAR = 2019:2024,
                        FULL_VALID_YR = c(F, T, F, T, T, F))
dt_target
#>     ID YEAR FULL_VALID_YR
#> 1: ABC 2019         FALSE
#> 2: ABC 2020          TRUE
#> 3: ABC 2021         FALSE
#> 4: ABC 2022          TRUE
#> 5: ABC 2023          TRUE
#> 6: ABC 2024         FALSE

2019 is FALSE because it lacks the last 4 months of 2018 (September to December)
2020 is TRUE because the full year, starting from September 1st, 2019 until August 31st, 2020 is valid
2021 is FALSE because the time period from 2021-02-01 to 2021-05-31 is not valid
2022 is TRUE because the full year, starting from September 1st, 2021 until August 31st, 2022 is valid
2023 is TRUE because the full year, starting from September 1st, 2022 until August 31st, 2023 is valid
2024 is FALSE because it lacks the first 8 months of 2024 (January to August)

Final Notes:

Whether or not to include the non-full years (2019 & 2024) is optional since I will not consider them any further anyway.

It is, of course, arbitrary whether to use the start or the end (as I did in the example above) of the period to determine the years name.

CodePudding user response：

Not a data.table solution, but should be possible to integrate it in a data.table framework. I use the ivs package to solve the problem:

dtValid <- dt[dt$VALID]
ivs <- iv_groups(iv(as.Date(dtValid$TIME_FROM), as.Date(dtValid$TIME_TO)   1))
#<iv<date>[2]>
#[1] [2019-01-01, 2021-02-01) [2021-06-01, 2024-01-01)

iv_yearly <-
  iv(seq(as.Date("2018-09-01"), by = "1 year", length.out = 6),
   seq(as.Date("2019-08-31"), by = "1 year", length.out = 6))
#<iv<date>[6]>
#[1] [2018-09-01, 2019-08-31) [2019-09-01, 2020-08-31)
#[3] [2020-09-01, 2021-08-31) [2021-09-01, 2022-08-31)
#[5] [2022-09-01, 2023-08-31) [2023-09-01, 2024-08-31)

iv_overlaps(iv_yearly, ivs, type = "within")
#[1] FALSE  TRUE FALSE  TRUE  TRUE FALSE