R: Compute number of rows within time interval-CodePudding

let's assume the following dataframe:

df <- tibble(ID = c(12, 12, 12, 13, 13, 13),
         times = c(as.POSIXct("2021-01-02 10:00:00"),
                   as.POSIXct("2021-01-02 11:00:00"),
                   as.POSIXct("2021-01-02 13:00:00"),
                   as.POSIXct("2021-01-02 13:00:00"),
                   as.POSIXct("2021-01-02 14:00:00"),
                   as.POSIXct("2021-01-02 15:00:00")))
        ID times              
  <dbl> <dttm>             
1    12 2021-01-02 10:00:00
2    12 2021-01-02 11:00:00
3    12 2021-01-02 13:00:00
4    13 2021-01-02 13:00:00
5    13 2021-01-02 14:00:00
6    13 2021-01-02 15:00:00

What I want is a column, that considers each timestamp of an ID as start value and computes the number of subsequent observation with the next 2h. So this is my goal:

     ID times               n_obs_within_2h
  <dbl> <dttm>                        <dbl>
1    12 2021-01-02 10:00:00               2
2    12 2021-01-02 11:00:00               2
3    12 2021-01-02 13:00:00               1
4    13 2021-01-02 13:00:00               3
5    13 2021-01-02 14:00:00               2
6    13 2021-01-02 15:00:00               1

I know that this could be easily done with purrr::map by iterating over each row. However, my original dataset it quite big which makes it fairly unefficient to do so. Can you think of another way than iterating over each row to achieve the computing n_obs_within_2h

EDIT: my current attempt:

df %>% group_by(ID) %>% 
  mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times), 
                                         .f = function(i, t, data) {
                                           n <- data %>%
                                             filter(ID == i) %>%
                                             filter(between(as.double.difftime(times-t, units = "hours"),
                                                            0, 2)) %>%
                                             nrow()
                                           return(n)
                                         }, data = .))

CodePudding user response：

Maybe a vectorised approach using a sliding window to count subsequent observations within the next 2 hours?

library(tidyverse)
library(lubridate)
library(slider)

df <- tibble(
  ID = c(12, 12, 12, 13, 13, 13),
  times = c(
    as.POSIXct("2021-01-02 10:00:00"),
    as.POSIXct("2021-01-02 11:00:00"),
    as.POSIXct("2021-01-02 13:00:00"),
    as.POSIXct("2021-01-02 13:00:00"),
    as.POSIXct("2021-01-02 14:00:00"),
    as.POSIXct("2021-01-02 15:00:00")
  )
)

df |>
  group_by(ID) |>
  mutate(
    diff = difftime(times, min(times), units = "hours"),
    within_2 = if_else(diff <= 2, 1, 0),
    n_obs_within_2h = slide_dbl(within_2, sum, .after = Inf)
  ) |> 
  ungroup()

#> # A tibble: 6 × 5
#>      ID times               diff    within_2 n_obs_within_2h
#>   <dbl> <dttm>              <drtn>     <dbl>           <dbl>
#> 1    12 2021-01-02 10:00:00 0 hours        1               2
#> 2    12 2021-01-02 11:00:00 1 hours        1               1
#> 3    12 2021-01-02 13:00:00 3 hours        0               0
#> 4    13 2021-01-02 13:00:00 0 hours        1               3
#> 5    13 2021-01-02 14:00:00 1 hours        1               2
#> 6    13 2021-01-02 15:00:00 2 hours        1               1

^{Created on 2022-06-30 by the reprex package (v2.0.1)}

CodePudding user response：

Using another approach within the mapping is probably more likely to be the key to better performance in this case. Instead of using filtering on the full data, we could utilize the grouped structure itself like this :

df |>
  group_by(ID) %>%
  mutate(n_obs_with_2h = purrr::map_dbl(times, ~ sum(difftime(times[ID == ID], ., units = "hours") <= 2 & difftime(times[ID == ID], ., units = "hours") >= 0))) %>%
  ungroup() 

# A tibble: 6 × 3
#     ID times               n_obs_with_2h
#  <dbl> <dttm>                      <dbl>
#     12 2021-01-02 10:00:00             2
#     12 2021-01-02 11:00:00             2
#     12 2021-01-02 13:00:00             1
#     13 2021-01-02 13:00:00             3
#     13 2021-01-02 14:00:00             2
#     13 2021-01-02 15:00:00             1

See benchmark: (Even if the provided data is too small for this to be reliable. That being said I would expect it to be even faster on a bigger set)

fun_original <- function(df) {
  
  df %>% group_by(ID) %>% 
    mutate(n_obs_with_2h = purrr::pmap_dbl(.l = list(ID, times), 
                                           .f = function(i, t, data) {
                                             n <- data %>%
                                               filter(ID == i) %>%
                                               filter(between(as.double.difftime(times-t, units = "hours"),
                                                              0, 2)) %>%
                                               nrow()
                                             return(n)
                                           }, data = .)) %>% ungroup()
  
}

fun_new <- function(df) {
  
  df |>
    group_by(ID) |>
    mutate(n_obs_with_2h = purrr::map_dbl(times, ~ sum(difftime(times[ID == ID], ., units = "hours") <= 2 & difftime(times[ID == ID], ., units = "hours") >= 0))) |>
    ungroup() 
  
}

bench::mark(fun_original(df), fun_new(df))

# A tibble: 2 × 13
#  expression            min   median `itr/sec` mem_alloc #`gc/sec` n_itr  n_gc total_time result  
#  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>  
#  fun_original(df)  15.53ms  16.38ms      59.9   45.77KB     15.6    23     6      384ms <tibble>
#  fun_new(df)        1.74ms   1.95ms     486.     6.02KB     10.8   224     5      461ms <tibble>