How to sequentially add time in R using dplyr/lubridate-CodePudding

I have a time series dataframe with duplicates for each minute (which is the first recorded time for the dive to start). I need to sequentially add 10 seconds by ptt (ID) and dive so that the dive starts at the current repeated time and ends at whatever the first number would be plus the amount of rows until the next dive (by 10 second intervals)

Current dataframe: df1

      date_time           ptt   dive
 1: 2016-03-10 13:35:00 153666  201
 2: 2016-03-10 13:35:00 153666  201
 3: 2016-03-10 13:35:00 153666  201
 4: 2016-03-10 13:35:00 153666  201
 5: 2016-03-10 13:35:00 153666  201
 6: 2016-03-10 13:35:00 153666  201
 7: 2016-03-10 13:35:00 153666  201
 8: 2016-03-10 13:35:00 153666  201
 9: 2016-03-10 13:35:00 153666  201
10: 2016-03-10 13:35:00 153666  201

dput:

structure(list(date_time = structure(c(1457616900, 1457616900, 
1457616900, 1457616900, 1457616900, 1457616900, 1457616900, 1457616900, 
1457616900, 1457616900, 1457616900, 1457616900, 1457616900, 1457616900, 
1457616900, 1457616900, 1457616900, 1457616900, 1457616900, 1457616900, 
1457616900), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    ptt = c(153666L, 153666L, 153666L, 153666L, 153666L, 153666L, 
    153666L, 153666L, 153666L, 153666L, 153666L, 153666L, 153666L, 
    153666L, 153666L, 153666L, 153666L, 153666L, 153666L, 153666L, 
    153666L), dive = c(201L, 201L, 201L, 201L, 201L, 201L, 201L, 
    201L, 201L, 201L, 201L, 201L, 201L, 201L, 201L, 201L, 201L, 
    201L, 201L, 201L, 201L)), row.names = c(NA, -21L), class = c("data.table", 
"data.frame")

I need to sequentially add 10 seconds to each date_time by ptt and dive

I have the following loop:

for (i in 2: nrow(df)){
  if (df$dive [i] == df$dive[i-1])
    df$date_time[i] = df$date_time[i-1]   seconds (10)
}

Which works well, but as I have over 1 million data records it has been 3 days and the code still hasn't finished. Does anyone know how to do this quickly using dplyr and/or lubridate ?

The expected outcome should look like this: dfnew

     date_time            ptt   dive
 1: 2016-03-10 13:35:00 153666  201
 2: 2016-03-10 13:35:10 153666  201
 3: 2016-03-10 13:35:20 153666  201
 4: 2016-03-10 13:35:30 153666  201
 5: 2016-03-10 13:35:40 153666  201
 6: 2016-03-10 13:35:50 153666  201
 7: 2016-03-10 13:36:00 153666  201
 8: 2016-03-10 13:36:10 153666  201
 9: 2016-03-10 13:36:20 153666  201
10: 2016-03-10 13:36:30 153666  201 ...

CodePudding user response：

Expanding on the already proposed solution which is great syntax-wise, but may or may not help with your performance issues.

What will help is to change the way these loops, explicit or implicit, are executed in the backend.

dtplyr is a good example of this (relying on a data.table backend). Rcpp would also be another way.

# first let's add another ID
for(i in 1:3) {
df <- rbind(df, list(ymd_hms("2016-03-11 13:35:10"), 123, 456))
}
library(dtplyr)
res_dplyr <- df %>% group_by(dive, ptt) %>% mutate(new_date_time=first(date_time)   seconds(10 * (row_number()-1))) %>% as_tibble
res_dt <- df %>% lazy_dt() %>% group_by(dive, ptt) %>% mutate(new_date_time=first(date_time)   seconds(10 * (row_number()-1))) %>% as_tibble
identical(res_dt, res_dplyr)
[1] TRUE

Benchmarking

library(microbenchmark)
microbenchmark(
  res_dplyr = df %>% group_by(dive, ptt) %>% mutate(new_date_time=first(date_time)   seconds(10 * (row_number()-1))) %>% as_tibble,
  res_dt = df %>% lazy_dt() %>% group_by(dive, ptt) %>% mutate(new_date_time=first(date_time)   seconds(10 * (row_number()-1))) %>% as_tibble
)
Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval
 res_dplyr 2.580601 2.803251 3.093104 2.879601 3.071451 7.461002   100
    res_dt 2.599101 2.739351 3.001568 2.822751 2.901702 7.537400   100

Here, the improvement is slight, but on larger data sets it might be a different story.

CodePudding user response：

As Limey said in the comments this worked perfectly. Very simple and easy solution.

df %>%
group_by(dive, ptt) %>% 
mutate(new_date_time=first(date_time)   seconds(10 * (row_number()-1))