I have a time series dataframe with duplicates for each minute (which is the first recorded time for the dive to start). I need to sequentially add 10 seconds by ptt
(ID) and dive
so that the dive starts at the current repeated time and ends at whatever the first number would be plus the amount of rows until the next dive (by 10 second intervals)
Current dataframe: df1
date_time ptt dive
1: 2016-03-10 13:35:00 153666 201
2: 2016-03-10 13:35:00 153666 201
3: 2016-03-10 13:35:00 153666 201
4: 2016-03-10 13:35:00 153666 201
5: 2016-03-10 13:35:00 153666 201
6: 2016-03-10 13:35:00 153666 201
7: 2016-03-10 13:35:00 153666 201
8: 2016-03-10 13:35:00 153666 201
9: 2016-03-10 13:35:00 153666 201
10: 2016-03-10 13:35:00 153666 201
dput:
structure(list(date_time = structure(c(1457616900, 1457616900,
1457616900, 1457616900, 1457616900, 1457616900, 1457616900, 1457616900,
1457616900, 1457616900, 1457616900, 1457616900, 1457616900, 1457616900,
1457616900, 1457616900, 1457616900, 1457616900, 1457616900, 1457616900,
1457616900), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
ptt = c(153666L, 153666L, 153666L, 153666L, 153666L, 153666L,
153666L, 153666L, 153666L, 153666L, 153666L, 153666L, 153666L,
153666L, 153666L, 153666L, 153666L, 153666L, 153666L, 153666L,
153666L), dive = c(201L, 201L, 201L, 201L, 201L, 201L, 201L,
201L, 201L, 201L, 201L, 201L, 201L, 201L, 201L, 201L, 201L,
201L, 201L, 201L, 201L)), row.names = c(NA, -21L), class = c("data.table",
"data.frame")
I need to sequentially add 10 seconds to each date_time
by ptt
and dive
I have the following loop:
for (i in 2: nrow(df)){
if (df$dive [i] == df$dive[i-1])
df$date_time[i] = df$date_time[i-1] seconds (10)
}
Which works well, but as I have over 1 million data records it has been 3 days and the code still hasn't finished. Does anyone know how to do this quickly using dplyr
and/or lubridate
?
The expected outcome should look like this: dfnew
date_time ptt dive
1: 2016-03-10 13:35:00 153666 201
2: 2016-03-10 13:35:10 153666 201
3: 2016-03-10 13:35:20 153666 201
4: 2016-03-10 13:35:30 153666 201
5: 2016-03-10 13:35:40 153666 201
6: 2016-03-10 13:35:50 153666 201
7: 2016-03-10 13:36:00 153666 201
8: 2016-03-10 13:36:10 153666 201
9: 2016-03-10 13:36:20 153666 201
10: 2016-03-10 13:36:30 153666 201 ...
CodePudding user response:
Expanding on the already proposed solution which is great syntax-wise, but may or may not help with your performance issues.
What will help is to change the way these loops, explicit or implicit, are executed in the backend.
dtplyr
is a good example of this (relying on a data.table
backend). Rcpp
would also be another way.
# first let's add another ID
for(i in 1:3) {
df <- rbind(df, list(ymd_hms("2016-03-11 13:35:10"), 123, 456))
}
library(dtplyr)
res_dplyr <- df %>% group_by(dive, ptt) %>% mutate(new_date_time=first(date_time) seconds(10 * (row_number()-1))) %>% as_tibble
res_dt <- df %>% lazy_dt() %>% group_by(dive, ptt) %>% mutate(new_date_time=first(date_time) seconds(10 * (row_number()-1))) %>% as_tibble
identical(res_dt, res_dplyr)
[1] TRUE
Benchmarking
library(microbenchmark)
microbenchmark(
res_dplyr = df %>% group_by(dive, ptt) %>% mutate(new_date_time=first(date_time) seconds(10 * (row_number()-1))) %>% as_tibble,
res_dt = df %>% lazy_dt() %>% group_by(dive, ptt) %>% mutate(new_date_time=first(date_time) seconds(10 * (row_number()-1))) %>% as_tibble
)
Unit: milliseconds
expr min lq mean median uq max neval
res_dplyr 2.580601 2.803251 3.093104 2.879601 3.071451 7.461002 100
res_dt 2.599101 2.739351 3.001568 2.822751 2.901702 7.537400 100
Here, the improvement is slight, but on larger data sets it might be a different story.
CodePudding user response:
As Limey said in the comments this worked perfectly. Very simple and easy solution.
df %>%
group_by(dive, ptt) %>%
mutate(new_date_time=first(date_time) seconds(10 * (row_number()-1))