Home > Blockchain >  R- merge dataframes based on recent dates
R- merge dataframes based on recent dates

Time:11-25

I have two dataframes:

In DF1, for every ID, the param have been recorded on various dates. In DF2, for every ID, a number of dates are given. For every ID, I would like to get all the corresponding param and value from DF1, depending on the dates: either the value that corresponds to the most recent date1 (in DF1) before date2 (in DF2) for a given param or If there is no such date1, the most recent value after date2.

DF1 is (I have marked with * the correct rows for the result ):

  ID      date1 param  value
1 id1   1/1/2020    pA pA_1_1
2 id1   2/1/2020    pA pA_1_2 *
3 id1  17/1/2020    pA pA_1_3
4 id1  20/1/2020    pB pB_1_1 *
5 id1  21/1/2020    pB pB_1_2
6 id2 21/12/2022    pA pA_2_1 *
7 id2 22/12/2022    pA pA_2_2 
8 id2 18/12/2022    pB pB_2_1 *
9 id2 19/12/2022    pB pB_2_2 

DF2 is:

   ID      date2
1 id1  15/1/2020
2 id2 20/12/2020

The result should be:

   ID      date2 param  value      date1
1 id1  15/1/2020    pA pA_1_2   2/1/2020
2 id1  15/1/2020    pB pB_1_1  20/1/2020
3 id2 20/12/2020    pA pA_2_1 21/12/2022
4 id2 20/12/2020    pB pB_2_1 18/12/2022

Code to reproduce the DF1 and DF2:

DF1= data.frame(
  stringsAsFactors = FALSE,
                ID = c("id1","id1","id1","id1",
                       "id1","id2","id2","id2","id2"),
             date1 = c("1/1/2020","2/1/2020",
                       "17/1/2020","20/1/2020","21/1/2020","21/12/2022",
                       "22/12/2022","18/12/2022","19/12/2022"),
             param = c("pA", "pA", "pA", "pB", "pB", "pA", "pA", "pB", "pB"),
             value = c("pA_1_1","pA_1_2","pA_1_3",
                       "pB_1_1","pB_1_2","pA_2_1","pA_2_2","pB_2_1","pB_2_2")
)

DF2=data.frame(
  stringsAsFactors = FALSE,
                ID = c("id1", "id2"),
             date2 = c("15/1/2020", "20/12/2020")
)

CodePudding user response:

This is my solution. I'm sure there is a way to write this with less code (using one dataframe instead of two and later merging). But I don't know righ now.

library(tidyverse)
library(lubridate)
# Get before date2
before <-  DF1 %>%
  left_join(DF2,by = "ID") %>% 
  mutate(diff = dmy(date1)-dmy(date2)) %>% 
  mutate(Grp = data.table::rleid(param)) %>%
  filter(diff < 0) %>%
  group_by(Grp) %>%
  filter(diff == max(diff)) %>% 
  ungroup
# Get after date2
after <- DF1 %>%
  left_join(DF2,by = "ID") %>% 
  mutate(diff = dmy(date1)-dmy(date2)) %>% 
  mutate(Grp = data.table::rleid(param)) %>%
  filter(diff > 0) %>%
  group_by(Grp) %>%
  filter(! Grp %in% before$Grp, diff == min(diff)) %>% 
  ungroup

result <- bind_rows(before,after) %>% 
  select(ID,date2, param, value, date1) %>%
  arrange(ID, param)
 

Explanation: I'm using lubridate library to compare the dates. I do the same process to create two dataframes - first one (before df) for groups which accomplish first condition (closest date in DF1 before date2 in DF2), second one (after df) is for groups which do the other way round (closest date in DF1 after date2 in DF2).

I will explain first:

# Get before date2

    before <-  DF1 %>%
    left_join(DF2,by = "ID") %>% 
    mutate(diff = dmy(date1)-dmy(date2)) %>% 
    mutate(Grp = data.table::rleid(param)) %>%
    filter(diff < 0) %>%
    group_by(Grp) %>%
    filter(diff == max(diff)) %>% 
    ungroup

Here, we merge DF1 and DF2 by ID, so rows with same ID have the same date2. Then, we calculate the differences date1-date2 - first we convert characters to date using dmy(). Therefore, dates before date2 will be a negative difference. With data.table::rleid(param) we enumerate subgroups with different ID & param, so we can know the subgroups. Then we can group by then and filter by them.

At the end:

result <- bind_rows(before,after) %>% 
  select(ID,date2, param, value, date1) %>%
  arrange(ID, param)

We bind the two dataframe by rows and select the columns you are looking for, to delete the columns we created to operate with (group and filter). PS: I added arrange() to make sure the final df is sorted by ID and param values.

  • Related