let's assume the following dataframe:
df <- tibble(ID = c(12, 12, 12, 13, 13, 13),
times = c(as.POSIXct("2021-01-02 10:00:00"),
as.POSIXct("2021-01-02 11:00:00"),
as.POSIXct("2021-01-02 13:00:00"),
as.POSIXct("2021-01-02 13:00:00"),
as.POSIXct("2021-01-02 14:00:00"),
as.POSIXct("2021-01-02 15:00:00")))
ID times
<dbl> <dttm>
1 12 2021-01-02 10:00:00
2 12 2021-01-02 11:00:00
3 12 2021-01-02 13:00:00
4 13 2021-01-02 13:00:00
5 13 2021-01-02 14:00:00
6 13 2021-01-02 15:00:00
What I want is a column, that considers each timestamp of an ID as start value and computes the number of subsequent observation with the next 2h. So this is my goal:
ID times n_obs_within_2h
<dbl> <dttm> <dbl>
1 12 2021-01-02 10:00:00 2
2 12 2021-01-02 11:00:00 2
3 12 2021-01-02 13:00:00 1
4 13 2021-01-02 13:00:00 3
5 13 2021-01-02 14:00:00 2
6 13 2021-01-02 15:00:00 1
I know that this could be easily done with purrr::map
by iterating over each row. However, my original dataset it quite big which makes it fairly unefficient to do so. Can you think of another way than iterating over each row to achieve the computing n_obs_within_2h
EDIT: my current attempt:
df %>% group_by(ID) %>%
mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times),
.f = function(i, t, data) {
n <- data %>%
filter(ID == i) %>%
filter(between(as.double.difftime(times-t, units = "hours"),
0, 2)) %>%
nrow()
return(n)
}, data = .))
CodePudding user response:
Maybe a vectorised approach using a sliding window to count subsequent observations within the next 2 hours?
library(tidyverse)
library(lubridate)
library(slider)
df <- tibble(
ID = c(12, 12, 12, 13, 13, 13),
times = c(
as.POSIXct("2021-01-02 10:00:00"),
as.POSIXct("2021-01-02 11:00:00"),
as.POSIXct("2021-01-02 13:00:00"),
as.POSIXct("2021-01-02 13:00:00"),
as.POSIXct("2021-01-02 14:00:00"),
as.POSIXct("2021-01-02 15:00:00")
)
)
df |>
group_by(ID) |>
mutate(
diff = difftime(times, min(times), units = "hours"),
within_2 = if_else(diff <= 2, 1, 0),
n_obs_within_2h = slide_dbl(within_2, sum, .after = Inf)
) |>
ungroup()
#> # A tibble: 6 × 5
#> ID times diff within_2 n_obs_within_2h
#> <dbl> <dttm> <drtn> <dbl> <dbl>
#> 1 12 2021-01-02 10:00:00 0 hours 1 2
#> 2 12 2021-01-02 11:00:00 1 hours 1 1
#> 3 12 2021-01-02 13:00:00 3 hours 0 0
#> 4 13 2021-01-02 13:00:00 0 hours 1 3
#> 5 13 2021-01-02 14:00:00 1 hours 1 2
#> 6 13 2021-01-02 15:00:00 2 hours 1 1
Created on 2022-06-30 by the reprex package (v2.0.1)
CodePudding user response:
Using another approach within the mapping is probably more likely to be the key to better performance in this case. Instead of using filtering on the full data, we could utilize the grouped structure itself like this :
df |>
group_by(ID) %>%
mutate(n_obs_with_2h = purrr::map_dbl(times, ~ sum(difftime(times[ID == ID], ., units = "hours") <= 2 & difftime(times[ID == ID], ., units = "hours") >= 0))) %>%
ungroup()
# A tibble: 6 × 3
# ID times n_obs_with_2h
# <dbl> <dttm> <dbl>
# 12 2021-01-02 10:00:00 2
# 12 2021-01-02 11:00:00 2
# 12 2021-01-02 13:00:00 1
# 13 2021-01-02 13:00:00 3
# 13 2021-01-02 14:00:00 2
# 13 2021-01-02 15:00:00 1
See benchmark: (Even if the provided data is too small for this to be reliable. That being said I would expect it to be even faster on a bigger set)
fun_original <- function(df) {
df %>% group_by(ID) %>%
mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times),
.f = function(i, t, data) {
n <- data %>%
filter(ID == i) %>%
filter(between(as.double.difftime(times-t, units = "hours"),
0, 2)) %>%
nrow()
return(n)
}, data = .)) %>% ungroup()
}
fun_new <- function(df) {
df |>
group_by(ID) |>
mutate(n_obs_with_2h = purrr::map_dbl(times, ~ sum(difftime(times[ID == ID], ., units = "hours") <= 2 & difftime(times[ID == ID], ., units = "hours") >= 0))) |>
ungroup()
}
bench::mark(fun_original(df), fun_new(df))
# A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc #`gc/sec` n_itr n_gc total_time result
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
# fun_original(df) 15.53ms 16.38ms 59.9 45.77KB 15.6 23 6 384ms <tibble>
# fun_new(df) 1.74ms 1.95ms 486. 6.02KB 10.8 224 5 461ms <tibble>