Home > Net >  Aggregating Time Series Data by Arbitrary Time Periods
Aggregating Time Series Data by Arbitrary Time Periods

Time:12-10

I have data of the following structure:

library(data.table)
dt <- data.table(ID = "ABC",
           TIME_FROM = structure(c(17897L, 18262L, 18353L, 18628L,
                                   18659L, 18779L, 18993L, 19358L),
                                 class = c("IDate", "Date")),
           TIME_TO = structure(c(18261L, 18352L, 18627L, 18658L, 
                                 18778L, 18992L, 19357L, 19722L),
                               class = c("IDate", "Date")),
           VALID = c(T, T, T, T, F, T, T, T))
dt
#>     ID  TIME_FROM    TIME_TO VALID
#> 1: ABC 2019-01-01 2019-12-31  TRUE
#> 2: ABC 2020-01-01 2020-03-31  TRUE
#> 3: ABC 2020-04-01 2020-12-31  TRUE
#> 4: ABC 2021-01-01 2021-01-31  TRUE
#> 5: ABC 2021-02-01 2021-05-31 FALSE
#> 6: ABC 2021-06-01 2021-12-31  TRUE
#> 7: ABC 2022-01-01 2022-12-31  TRUE
#> 8: ABC 2023-01-01 2023-12-31  TRUE

I want to determine, by ID, full valid yearly circles that start at September 1st (e.g. 2020-09-01) and end at August 31st (e.g. 2021-08-31). Is there a non-convoluted approach, preferrable with a pure data.table solution?

Expected Outcome

dt_target <- data.table(ID = "ABC",
                        YEAR = 2019:2024,
                        FULL_VALID_YR = c(F, T, F, T, T, F))
dt_target
#>     ID YEAR FULL_VALID_YR
#> 1: ABC 2019         FALSE
#> 2: ABC 2020          TRUE
#> 3: ABC 2021         FALSE
#> 4: ABC 2022          TRUE
#> 5: ABC 2023          TRUE
#> 6: ABC 2024         FALSE
  • 2019 is FALSE because it lacks the last 4 months of 2018 (September to December)
  • 2020 is TRUE because the full year, starting from September 1st, 2019 until August 31st, 2020 is valid
  • 2021 is FALSE because the time period from 2021-02-01 to 2021-05-31 is not valid
  • 2022 is TRUE because the full year, starting from September 1st, 2021 until August 31st, 2022 is valid
  • 2023 is TRUE because the full year, starting from September 1st, 2022 until August 31st, 2023 is valid
  • 2024 is FALSE because it lacks the first 8 months of 2024 (January to August)

Final Notes:

Whether or not to include the non-full years (2019 & 2024) is optional since I will not consider them any further anyway.

It is, of course, arbitrary whether to use the start or the end (as I did in the example above) of the period to determine the years name.

CodePudding user response:

Not a data.table solution, but should be possible to integrate it in a data.table framework. I use the ivs package to solve the problem:

dtValid <- dt[dt$VALID]
ivs <- iv_groups(iv(as.Date(dtValid$TIME_FROM), as.Date(dtValid$TIME_TO)   1))
#<iv<date>[2]>
#[1] [2019-01-01, 2021-02-01) [2021-06-01, 2024-01-01)

iv_yearly <-
  iv(seq(as.Date("2018-09-01"), by = "1 year", length.out = 6),
   seq(as.Date("2019-08-31"), by = "1 year", length.out = 6))
#<iv<date>[6]>
#[1] [2018-09-01, 2019-08-31) [2019-09-01, 2020-08-31)
#[3] [2020-09-01, 2021-08-31) [2021-09-01, 2022-08-31)
#[5] [2022-09-01, 2023-08-31) [2023-09-01, 2024-08-31)

iv_overlaps(iv_yearly, ivs, type = "within")
#[1] FALSE  TRUE FALSE  TRUE  TRUE FALSE
  • Related