How to join tibbles/dataframes with different row numbers by using the ID and interval date/time?-CodePudding

I have these two datasets exemplified below:

library(lubridate)
library(tidyverse)

#dataset 1

id <- c("A_1", "A_1", "A_1", "A_1", "A_1", "A_2", "A_2", "A_2", "A_2", 
        "A_2", "B_1", "B_1", "B_1", "B_1", "B_1", "B_2", "B_2", "B_2", "B_2", 
        "B_2")
date <- ymd_hms(c("2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00"))    

df <- tibble(id, date)

# A tibble: 20 x 2
   id    date               
   <chr> <dttm>             
 1 A_1   2017-11-26 09:00:00
 2 A_1   2017-11-26 09:05:00
 3 A_1   2017-11-30 09:00:00
 4 A_1   2017-11-30 09:05:00
 5 A_1   2017-12-02 09:00:00
 6 A_2   2017-11-26 09:00:00
 7 A_2   2017-11-26 09:05:00
 8 A_2   2017-11-30 09:00:00
 9 A_2   2017-11-30 09:05:00
10 A_2   2017-12-02 09:00:00
11 B_1   2017-11-26 09:00:00
12 B_1   2017-11-26 09:05:00
13 B_1   2017-11-30 09:00:00
14 B_1   2017-11-30 09:05:00
15 B_1   2017-12-02 09:00:00
16 B_2   2017-11-26 09:00:00
17 B_2   2017-11-26 09:05:00
18 B_2   2017-11-30 09:00:00
19 B_2   2017-11-30 09:05:00
20 B_2   2017-12-02 09:00:00

#dataset 2

id <- c("A", "A", "B", "B")
date <- ymd_hms(c("2017-11-26 09:01:30", "2017-11-30 09:06:40", "2017-11-30 09:04:50", "2017-12-02 09:01:00"))
variable1 <- c("67", "30", "28", "90")
variable2 <- c("x","y","z", "w")
df2 <- tibble(id, date, variable1, variable2)

# A tibble: 4 x 4
  id    date                variable1 variable2
  <chr> <dttm>              <chr>     <chr>    
1 A     2017-11-26 09:01:30 67        x        
2 A     2017-11-30 09:06:40 30        y        
3 B     2017-11-30 09:04:50 28        z        
4 B     2017-12-02 09:01:00 90        w

I first need to group by "id", then by "date and time", and then extract the columns of dataset 2 for the nearest hour in the dataset 1 (condition: for each row connect to previous maximum hour 5 min) creating new columns in the dataset 1.

But, each "id" in the data set 2, occurs 50 time in the dataset 1, herefore, an row present in dataset 1 probabilly will find an corresponding hour 50 times in the dataset 1 to same date. I need that, for each "id", this "extraction" is done the same number of times as there is a corresponding hour, even if it is frequent.

The resulting dataset would look like this:

df_output
# A tibble: 20 x 5
   id    date                date2               variable1 variable2
   <chr> <dttm>              <chr>               <chr>     <chr>    
 1 A_1   2017-11-26 09:00:00 2017-11-26 09:01:30 67        x        
 2 A_1   2017-11-26 09:05:00 NA                  NA        NA       
 3 A_1   2017-11-30 09:00:00 NA                  NA        NA       
 4 A_1   2017-11-30 09:05:00 2017-11-30 09:06:40 30        y        
 5 A_1   2017-12-02 09:00:00 NA                  NA        NA       
 6 A_2   2017-11-26 09:00:00 2017-11-26 09:01:30 67        x        
 7 A_2   2017-11-26 09:05:00 NA                  NA        NA       
 8 A_2   2017-11-30 09:00:00 NA                  NA        NA       
 9 A_2   2017-11-30 09:05:00 2017-11-30 09:06:40 30        y        
10 A_2   2017-12-02 09:00:00 NA                  NA        NA       
11 B_1   2017-11-26 09:00:00 NA                  NA        NA       
12 B_1   2017-11-26 09:05:00 NA                  NA        NA       
13 B_1   2017-11-30 09:00:00 2017-11-30 09:04:50 28        z        
14 B_1   2017-11-30 09:05:00 NA                  NA        NA       
15 B_1   2017-12-02 09:00:00 2017-12-02 09:01:00 90        w        
16 B_2   2017-11-26 09:00:00 NA                  NA        NA       
17 B_2   2017-11-26 09:05:00 NA                  NA        NA       
18 B_2   2017-11-30 09:00:00 2017-11-30 09:04:50 28        z        
19 B_2   2017-11-30 09:05:00 NA                  NA        NA       
20 B_2   2017-12-02 09:00:00 2017-12-02 09:01:00 90        w

note: I still need to consider that not all rows will have something corresponding in dataset2, therefore, these must be filled with NA's.

Thanks in advance.

CodePudding user response：

We may use ceiling_date from lubridate to change the date to '5 min' interval. Then do a non-equi join with data.table

library(lubridate)
library(dplyr)
library(data.table)
df2new <- df2 %>%
   mutate(date2 = ceiling_date(date, "5 min"), 
          date = floor_date(date, "5 min"))
setDT(df)[, id2:= trimws(id, whitespace = "_\\d ")][
   setDT(df2new), c('date2', 'variable1', 'variable2') := .(date2,  
    variable1, variable2), on = .(id2 = id, date > date, date <= date2)]