Home > Enterprise >  Merging two tables considering the range of dates from the second table using R
Merging two tables considering the range of dates from the second table using R

Time:01-08

I have two dataframes:

df_1 <- data.frame(DATES_1=c("12-30-2022", "03-02-2023","05-05-2023"),
                 DV01_1=c(1,2,3))
df_2 <- data.frame(DATES_2=c("01-01-2023", "02-01-2023", "05-01-2023","06-01-2023"),
                 DV01_2=c(1,2,3,4))

I want to join the DV01_1 values from df_1 into df_2 respecting the range of dates in DATES_2. the final answer should be this dataframe:

df_3 <- data.frame(DATES_2=c("01-01-2023", "02-01-2023", "05-01-2023","06-01-2023"),
                 DV01_2=c(1,2,3,4), DV01_1=c(0,0,2,3))

So for example, the last element from df_1 should consider that

01/01/23,02/01/23,05/01/23 < DATES_1 < 06/01/23. 

This process can be easily replicated using excel formula SUMIFS, for example.

CodePudding user response:

We may use a non-equi join - Convert the dates to date class in each of the datasets, then shift the DATES_2 to get the previous date as starting date in the range, do a join with the df_1 using a non-equi join to create the DV01_1 column in the df_2

library(data.table)
df_2$DATES_2 <- as.Date(df_2$DATES_2, "%m-%d-%Y")
df_1$DATES_1 <- as.Date(df_1$DATES_1, "%m-%d-%Y")
setDT(df_2)[, st := shift(DATES_2, fill = first(DATES_2))]
df_2[df_1, DV01_1 := DV01_1, 
      on = .(st <= DATES_1, DATES_2 > DATES_1), mult = "last"]
df_2[, st := NULL][is.na(DV01_1), DV01_1 := 0]

-output

> df_2
      DATES_2 DV01_2 DV01_1
1: 2023-01-01      1      0
2: 2023-02-01      2      0
3: 2023-05-01      3      2
4: 2023-06-01      4      3

OP's expected output

> df_3
     DATES_2 DV01_2 DV01_1
1 01-01-2023      1      0
2 02-01-2023      2      0
3 05-01-2023      3      2
4 06-01-2023      4      3

Or this can be done with the devel version of dplyr

library(dplyr)
library(lubridate)
library(tidyr)
df_1 %>% 
  mutate(DATES_1 = mdy(DATES_1)) %>% 
  left_join(df_2 %>% 
      mutate(DATES_2 = mdy(DATES_2),
              st = lag(DATES_2, default = first(DATES_2))), .,
    by = join_by(closest(st <= DATES_1), 
    closest(DATES_2 > DATES_1))) %>%
   select(names(df_2), DV01_1) %>%
  mutate(DV01_1 = replace_na(DV01_1, 0))

-output

      DATES_2 DV01_2 DV01_1
1 2023-01-01      1      0
2 2023-02-01      2      0
3 2023-05-01      3      2
4 2023-06-01      4      3
  • Related