Completing and filling several timeseries with all seconds between start and end within one data.tab-CodePudding

I have a data.table which looks like this:

    sample <- structure(list(TimeStamp = structure(c(1629036673.5, 1629036676.35, 
              1629036760.29, 1629036760.95, 1629036769.3, 1629037223.1, 1629037237.81, 
              1629037241.6, 1629037243.34), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
              Hen = c("CXU", "CXU", "CXU", "CXU", "CXU", "CCE", "CCE", 
              "CCE", "CCE"), Units = c("M2", "M2", "HM2", "HM2", "H2", 
              "HM2", "HM2", "M2", "M2"), dummy = c(1L, 1L, 2L, 2L, 3L, 
              103L, 103L, 104L, 104L), Timing = c("Start", "End", "Start", 
              "End", "Start", "Start", "End", "Start", "End")), row.names = c(NA, 
               -9L), class = c("data.table", "data.frame"))
    > sample
             TimeStamp Hen Units dummy Timing
1: 2021-08-15 14:11:13 CXU    M2     1  Start
2: 2021-08-15 14:11:16 CXU    M2     1    End
3: 2021-08-15 14:12:40 CXU   HM2     2  Start
4: 2021-08-15 14:12:40 CXU   HM2     2    End
5: 2021-08-15 14:12:49 CXU    H2     3  Start
6: 2021-08-15 14:20:23 CCE   HM2   103  Start
7: 2021-08-15 14:20:37 CCE   HM2   103    End
8: 2021-08-15 14:20:41 CCE    M2   104  Start
9: 2021-08-15 14:20:43 CCE    M2   104    End

What I need is to complete every single time sequence by seconds indicated by Start and End (sometimes only a start exists, meaning the entry is both start and end) and to fill up some of the variables. I was able to achieve this using complete and fill (using packages tidyr and dplyr):

> sample %>% 
      group_by(dummy) %>% 
      complete(TimeStamp = seq(first(TimeStamp), max(TimeStamp), by = "sec")) %>% 
      fill(Hen, Units, dummy)
# A tibble: 26 x 5
# Groups:   dummy [5]
   dummy TimeStamp           Hen   Units Timing
   <int> <dttm>              <chr> <chr> <chr> 
 1     1 2021-08-15 14:11:13 CXU   M2    Start 
 2     1 2021-08-15 14:11:14 CXU   M2    NA    
 3     1 2021-08-15 14:11:15 CXU   M2    NA    
 4     1 2021-08-15 14:11:16 CXU   M2    End   
 5     2 2021-08-15 14:12:40 CXU   HM2   Start 
 6     2 2021-08-15 14:12:40 CXU   HM2   End   
 7     3 2021-08-15 14:12:49 CXU   H2    Start 
 8   103 2021-08-15 14:20:23 CCE   HM2   Start 
 9   103 2021-08-15 14:20:24 CCE   HM2   NA    
10   103 2021-08-15 14:20:25 CCE   HM2   NA    
# ... with 16 more rows

My problem is that I am using data.table for a reason, which is that I have a huge amount of data and the solution I found just runs much too slow. My idea was to use a more data.table base approach by first creating a separate time sequence and to merge the the two. Filling could then be done using na.locf from zoo. However I am failing to create the time sequence in any efficient manner. Does anyone have any ideas?

CodePudding user response：

Try this approach... Also, your are using decimals in your timestamps, I suggest rounding before creating a sequence.

library(data.table)
library(zoo)
# you need to round your decimal timestamps to seconds
#  use ceiling() or floor() if you prefer
sample[, TimeStamp := as.POSIXct(round(TimeStamp))]
# Create a data.table with min-max sequences
DT <- sample[, .(TimeStamp = seq(min(TimeStamp), max(TimeStamp), by = 1)), by = .(dummy)]
# Perform update join
DT[sample, 
   `:=`(Hen = i.Hen, Units = i.Units, Timing = i.Timing),
   on = .(dummy, TimeStamp)]
# Fill down NA's Hen and Units columns
#  data.table's setnafill doens not (yet?) support character columns
#  so we use zoo:na.locf()
DT[, c("Hen", "Units") := lapply(.SD, zoo::na.locf), 
   .SDcols = c("Hen", "Units")]

#    dummy           TimeStamp Hen Units Timing
# 1:     1 2021-08-15 14:11:14 CXU    M2  Start
# 2:     1 2021-08-15 14:11:15 CXU    M2   <NA>
# 3:     1 2021-08-15 14:11:16 CXU    M2    End
# 4:     2 2021-08-15 14:12:40 CXU   HM2  Start
# 5:     2 2021-08-15 14:12:41 CXU   HM2    End
# 6:     3 2021-08-15 14:12:49 CXU    H2  Start
# 7:   103 2021-08-15 14:20:23 CCE   HM2  Start
# 8:   103 2021-08-15 14:20:24 CCE   HM2   <NA>
# 9:   103 2021-08-15 14:20:25 CCE   HM2   <NA>
#10:   103 2021-08-15 14:20:26 CCE   HM2   <NA>
#11:   103 2021-08-15 14:20:27 CCE   HM2   <NA>
#12:   103 2021-08-15 14:20:28 CCE   HM2   <NA>
#13:   103 2021-08-15 14:20:29 CCE   HM2   <NA>
#14:   103 2021-08-15 14:20:30 CCE   HM2   <NA>
#15:   103 2021-08-15 14:20:31 CCE   HM2   <NA>
#16:   103 2021-08-15 14:20:32 CCE   HM2   <NA>
#17:   103 2021-08-15 14:20:33 CCE   HM2   <NA>
#18:   103 2021-08-15 14:20:34 CCE   HM2   <NA>
#19:   103 2021-08-15 14:20:35 CCE   HM2   <NA>
#20:   103 2021-08-15 14:20:36 CCE   HM2   <NA>
#21:   103 2021-08-15 14:20:37 CCE   HM2   <NA>
#22:   103 2021-08-15 14:20:38 CCE   HM2    End
#23:   104 2021-08-15 14:20:42 CCE    M2  Start
#24:   104 2021-08-15 14:20:43 CCE    M2    End
#    dummy           TimeStamp Hen Units Timing