Data tidying on date/time-CodePudding

I have a very large dataset with date and time in a single column on 15-minute intervals corresponding to the data. Unfortunately the software recording the data has some issues and so randomly there are 15-minute intervals (usually 1 or 2 but sometime 3 and 4). The dataset is reported as follows:

   Date_and_time         Pressure
   2016-07-08 18:00:00    3.542
   2016-07-08 18:15:00    5:444
   2016-07-08 18:45:00    2:556
   2016-07-08 19:00:00    4:567

I am looking for a way to enter a row inbetween the missing time frames. My goal is to stack this data for multiple sites on top of each other I and I need to make sure for graphing purposes that the line up.

CodePudding user response：

If you can perfectly guarantee that all times are aligned on the quarter hour, then you could try this:

tibble(Date_and_time = do.call(seq, c(as.list(range(dat$Date_and_time)), by="15 mins"))) %>%
  full_join(dat, by = "Date_and_time")
# # A tibble: 5 x 2
#   Date_and_time       Pressure
#   <dttm>              <chr>   
# 1 2016-07-08 18:00:00 3.542   
# 2 2016-07-08 18:15:00 5:444   
# 3 2016-07-08 18:30:00 <NA>    
# 4 2016-07-08 18:45:00 2:556   
# 5 2016-07-08 19:00:00 4:567

If you think there is a chance that your times are not perfectly aligned (even a fraction of a second will introduce unnecessary rows), then we can turn this into a problem of "enforce a gap of no more than 15 minutes":

dat %>%
  group_by(grp = cumsum(c(FALSE, as.numeric(diff(Date_and_time), units = "mins") > 15))) %>%
  summarize(Date_and_time = max(Date_and_time)   15*60) %>%
  bind_rows(dat) %>%
  arrange(Date_and_time) %>%
  select(-grp)
# # A tibble: 6 x 2
#   Date_and_time       Pressure
#   <dttm>              <chr>   
# 1 2016-07-08 18:00:00 3.542   
# 2 2016-07-08 18:15:00 5:444   
# 3 2016-07-08 18:30:00 <NA>    
# 4 2016-07-08 18:45:00 2:556   
# 5 2016-07-08 19:00:00 4:567   
# 6 2016-07-08 19:15:00 <NA>

Notice that the last added row is unnecessary, that can be removed in a simple clean-up step. The premise of this second method is that it creates a group where everything within the group is gapped 15 minutes (or less), and then adds 15 minutes to the last one row. This ensures that there is no gap greater than 15 minutes, but:

It will always produce a single row at the bottom that may not be needed; and
It does not make any assurance of the gap between the added rows and the rows beneath them. For example, if your third row was instead at "2016-07-08 18:31:00", then the time would sequence through "18:15:00", "18:30:00", then "18:31:00" (with a 1-minute gap).

Data

dat <- structure(list(Date_and_time = structure(c(1468015200, 1468016100, 1468017900, 1468018800), class = c("POSIXct", "POSIXt"), tzone = ""), Pressure = c("3.542", "5:444", "2:556", "4:567")), row.names = c(NA, -4L), class = "data.frame")

CodePudding user response：

You could make a sequence that has all potential sampling times and then join your data to that.

library(tidyverse)
ALL_PERIODS <-data.frame(SAMPLE_TIME= seq.POSIXt(from = as.POSIXlt("2016-07-08 18:00:00"), to =as.POSIXlt("2016-07-08 20:00:00"), by = "15 min"))

SAMPLE_DATA <- data.frame(Date_and_time= as.POSIXlt( c("2016-07-08 18:00:00","2016-07-08 18:15:00","2016-07-08 18:45:00","2016-07-08 19:00:00") ), pressure=c(3.542, 5.444,2.556, 4.567))


ALL_PERIODS_DATA <- left_join(ALL_PERIODS,SAMPLE_DATA, by=c("SAMPLE_TIME"="Date_and_time"))