I have a data table (lv_timest
) with time stamps every 3 hours for each date:
# A tibble: 6 × 5
LV0_mean LV1_mean LV2_mean Date_time Date
<dbl> <dbl> <dbl> <S3:POSIXct> <date>
1 0.778 -4.12 0.736 2016-12-28 00:00:00 2016-12-28
2 0.376 -0.234 0.388 2016-12-28 03:00:00 2016-12-28
3 0.409 1.46 0.241 2016-12-28 06:00:00 2016-12-28
4 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
5 0.759 2.91 0.735 2016-12-28 12:00:00 2016-12-28
6 0.857 3.00 0.803 2016-12-28 15:00:00 2016-12-28
from which I would like to extract the time stamps that match as closely as possible those of another table (event_timest
):
# A tibble: 6 × 4
Event_number Date_time Date Date_time_new
<int> <S3: POSIXct> <date> <S3: POSIXct>
1 75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00
2 123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00
3 264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00
4 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00
5 318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00
6 369 2017-01-11 07:24:00 2017-01-11 2017-01-11 07:00:00
For example, for row 1 in table event_timest
, I would extract row 4 from table lv_timest
:
Event_number Date_time.x Date.x Date_time_new LV0_mean LV1_mean LV2_mean Date_time.y Date.y
<int> <S3: POSIXct> <date> <S3: POSIXct> <dbl> <dbl> <dbl> <S3: POSIXct> <date>
75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
In fact, the time difference should not be over one hour. I thought of using the fuzzyjoin
package for this, and writing a function that computes the time difference between timestamps of the two table, as hours. However, fuzzy_inner_join
replicates rows in the second table and takes several timestamps in the first table to match it.
require(lubridate)
require(fuzzyjoin)
diff_timest <- function(x, y){abs(x%--%y %/% hours(1)) <= 1} # time interval as hours ≤ 1 hour
match_timest <- fuzzy_inner_join(event_timest, lv_timest,
by = c("Date" = "Date",
"Date_time_new" = "Date_time"),
match_fun = list(`==`, diff_timest))
head(match_timest)
# A tibble: 6 × 9
Event_number Date_time.x Date.x Date_time_new LV0_mean LV1_mean LV2_mean Date_time.y Date.y
<int> <dttm> <date> <dttm> <dbl> <dbl> <dbl> <dttm> <date>
1 75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
2 123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00 1.24 1.83 2.05 2016-12-30 15:00:00 2016-12-30
3 264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00 -0.128 -5.43 2.72 2017-01-07 06:00:00 2017-01-07
4 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00 -0.0751 0.171 2.56 2017-01-09 09:00:00 2017-01-09
5 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00 -0.204 -0.797 2.28 2017-01-09 12:00:00 2017-01-09
6 318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00 -0.204 -0.797 2.28 2017-01-09 12:00:00 2017-01-09
Would there be another way to do this?
CodePudding user response:
Joining is always a procedure of first getting all combinations of all rows followed by a filter. We can do this manually:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
datetimes_a <- tibble(
id = seq(3),
group = "A",
datetime = c("2016-12-28 00:00:00", "2016-12-28 03:00:00", "2016-12-28 23:59:59") %>% as.POSIXct()
)
datetimes_b <- tibble(
id = seq(3),
group = "B",
datetime = c("2016-12-28 00:00:10", "2016-12-28 03:20:00", "2016-12-29 00:00:02") %>% as.POSIXct()
)
datetimes_a %>%
# start with cross product of all possible pairs
expand_grid(datetimes_b %>% rename_all(~ paste0(.x, "_b"))) %>%
mutate(diff = abs(datetime - datetime_b)) %>%
# get shortest time difference
group_by(id, id_b) %>%
arrange(diff) %>%
slice(1) %>%
# time diff must be less than 1hr
filter(diff < hours(1))
#> # A tibble: 3 x 7
#> # Groups: id, id_b [3]
#> id group datetime id_b group_b datetime_b diff
#> <int> <chr> <dttm> <int> <chr> <dttm> <drtn>
#> 1 1 A 2016-12-28 00:00:00 1 B 2016-12-28 00:00:10 10 secs
#> 2 2 A 2016-12-28 03:00:00 2 B 2016-12-28 03:20:00 1200 secs
#> 3 3 A 2016-12-28 23:59:59 3 B 2016-12-29 00:00:02 3 secs
Created on 2022-02-08 by the reprex package (v2.0.1)
This works also if the nearest timepoint is on another date e.g. right before and after midnight.
CodePudding user response:
I would suggest a standard join, followed by a grouped filter to the closest instance of each timestamp:
library(tidyverse)
library(lubridate)
match_timest <- event_timest %>%
inner_join(lv_timest, by = "Date") %>%
mutate(diff = abs(as.numeric(Date_time.x - Date_time.y, unit = "hours"))) %>%
group_by(Date_time.y) %>%
filter(diff <= 1 & diff == min(diff)) %>%
ungroup() %>%
select(!diff)
Note:
- this will still match multiple rows if there’s more than one that are the exact same shortest difference from the index timestamp.
- this won’t match timestamps from different dates — eg, 23:59:59 on 1/1/22 won’t be matched with 00:00:00 on 1/2/22. If you’d like to do that, you can use a full Cartesian join (
full_join(lv_timest, by = character())
) rather than theinner_join()
above.