Match two tables based on a time difference criterium-CodePudding

I have a data table (lv_timest) with time stamps every 3 hours for each date:

# A tibble: 6 × 5
     LV0_mean LV1_mean  LV2_mean Date_time           Date      
     <dbl>    <dbl>     <dbl>    <S3:POSIXct>        <date>    
1    0.778    -4.12     0.736    2016-12-28 00:00:00 2016-12-28
2    0.376    -0.234    0.388    2016-12-28 03:00:00 2016-12-28
3    0.409    1.46      0.241    2016-12-28 06:00:00 2016-12-28
4    0.760    2.07      0.460    2016-12-28 09:00:00 2016-12-28
5    0.759    2.91      0.735    2016-12-28 12:00:00 2016-12-28
6    0.857    3.00      0.803    2016-12-28 15:00:00 2016-12-28

from which I would like to extract the time stamps that match as closely as possible those of another table (event_timest):

# A tibble: 6 × 4
   Event_number Date_time           Date       Date_time_new
   <int>        <S3: POSIXct>       <date>     <S3: POSIXct>
1  75           2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00
2  123          2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00
3  264          2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00
4  317          2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00
5  318          2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00
6  369          2017-01-11 07:24:00 2017-01-11 2017-01-11 07:00:00

For example, for row 1 in table event_timest, I would extract row 4 from table lv_timest:

Event_number Date_time.x          Date.x      Date_time_new      LV0_mean LV1_mean   LV2_mean Date_time.y          Date.y
<int>        <S3: POSIXct>        <date>      <S3: POSIXct>      <dbl>    <dbl>      <dbl>    <S3: POSIXct>        <date>                         
75           2016-12-28 08:00:00  2016-12-28 2016-12-28 08:00:00 0.760    2.07       0.460    2016-12-28 09:00:00  2016-12-28

In fact, the time difference should not be over one hour. I thought of using the fuzzyjoin package for this, and writing a function that computes the time difference between timestamps of the two table, as hours. However, fuzzy_inner_join replicates rows in the second table and takes several timestamps in the first table to match it.

require(lubridate)
require(fuzzyjoin)

diff_timest <- function(x, y){abs(x%--%y %/% hours(1)) <= 1} # time interval as hours ≤ 1 hour

match_timest <- fuzzy_inner_join(event_timest, lv_timest,
                                 by = c("Date" = "Date",
                                        "Date_time_new" = "Date_time"),
                                 match_fun = list(`==`, diff_timest))
head(match_timest)

# A tibble: 6 × 9
  Event_number Date_time.x         Date.x     Date_time_new       LV0_mean LV1_mean LV2_mean Date_time.y         Date.y    
         <int> <dttm>              <date>     <dttm>                 <dbl>    <dbl>    <dbl> <dttm>              <date>    
1           75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00   0.760     2.07     0.460 2016-12-28 09:00:00 2016-12-28
2          123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00   1.24      1.83     2.05  2016-12-30 15:00:00 2016-12-30
3          264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00  -0.128    -5.43     2.72  2017-01-07 06:00:00 2017-01-07
4          317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00  -0.0751    0.171    2.56  2017-01-09 09:00:00 2017-01-09
5          317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00  -0.204    -0.797    2.28  2017-01-09 12:00:00 2017-01-09
6          318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00  -0.204    -0.797    2.28  2017-01-09 12:00:00 2017-01-09

Would there be another way to do this?

CodePudding user response：

Joining is always a procedure of first getting all combinations of all rows followed by a filter. We can do this manually:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

datetimes_a <- tibble(
  id = seq(3),
  group = "A",
  datetime = c("2016-12-28 00:00:00", "2016-12-28 03:00:00", "2016-12-28 23:59:59") %>% as.POSIXct()
)
datetimes_b <- tibble(
  id = seq(3),
  group = "B",
  datetime = c("2016-12-28 00:00:10", "2016-12-28 03:20:00", "2016-12-29 00:00:02") %>% as.POSIXct()
)

datetimes_a %>%
  # start with cross product of all possible pairs
  expand_grid(datetimes_b %>% rename_all(~ paste0(.x, "_b"))) %>%
  mutate(diff = abs(datetime - datetime_b)) %>%
  # get shortest time difference
  group_by(id, id_b) %>%
  arrange(diff) %>%
  slice(1) %>%
  # time diff must be less than 1hr
  filter(diff < hours(1))
#> # A tibble: 3 x 7
#> # Groups:   id, id_b [3]
#>      id group datetime             id_b group_b datetime_b          diff     
#>   <int> <chr> <dttm>              <int> <chr>   <dttm>              <drtn>   
#> 1     1 A     2016-12-28 00:00:00     1 B       2016-12-28 00:00:10   10 secs
#> 2     2 A     2016-12-28 03:00:00     2 B       2016-12-28 03:20:00 1200 secs
#> 3     3 A     2016-12-28 23:59:59     3 B       2016-12-29 00:00:02    3 secs

^{Created on 2022-02-08 by the reprex package (v2.0.1)}

This works also if the nearest timepoint is on another date e.g. right before and after midnight.

CodePudding user response：

I would suggest a standard join, followed by a grouped filter to the closest instance of each timestamp:

library(tidyverse)
library(lubridate)

match_timest <- event_timest %>%
  inner_join(lv_timest, by = "Date") %>%
  mutate(diff = abs(as.numeric(Date_time.x - Date_time.y, unit = "hours"))) %>%
  group_by(Date_time.y) %>%
  filter(diff <= 1 & diff == min(diff)) %>%
  ungroup() %>%
  select(!diff)

Note:

this will still match multiple rows if there’s more than one that are the exact same shortest difference from the index timestamp.
this won’t match timestamps from different dates — eg, 23:59:59 on 1/1/22 won’t be matched with 00:00:00 on 1/2/22. If you’d like to do that, you can use a full Cartesian join (full_join(lv_timest, by = character())) rather than the inner_join() above.