I have two dataframes:
In DF1, for every ID
, the param
have been recorded on various dates.
In DF2, for every ID
, a number of dates are given.
For every ID
, I would like to get all the corresponding param
and value
from DF1, depending on the dates:
either the value
that corresponds to the most recent date1
(in DF1
) before date2
(in DF2
) for a given param or
If there is no such date1
, the most recent value
after date2
.
DF1
is (I have marked with * the correct rows for the result ):
ID date1 param value
1 id1 1/1/2020 pA pA_1_1
2 id1 2/1/2020 pA pA_1_2 *
3 id1 17/1/2020 pA pA_1_3
4 id1 20/1/2020 pB pB_1_1 *
5 id1 21/1/2020 pB pB_1_2
6 id2 21/12/2022 pA pA_2_1 *
7 id2 22/12/2022 pA pA_2_2
8 id2 18/12/2022 pB pB_2_1 *
9 id2 19/12/2022 pB pB_2_2
DF2
is:
ID date2
1 id1 15/1/2020
2 id2 20/12/2020
The result should be:
ID date2 param value date1
1 id1 15/1/2020 pA pA_1_2 2/1/2020
2 id1 15/1/2020 pB pB_1_1 20/1/2020
3 id2 20/12/2020 pA pA_2_1 21/12/2022
4 id2 20/12/2020 pB pB_2_1 18/12/2022
Code to reproduce the DF1
and DF2
:
DF1= data.frame(
stringsAsFactors = FALSE,
ID = c("id1","id1","id1","id1",
"id1","id2","id2","id2","id2"),
date1 = c("1/1/2020","2/1/2020",
"17/1/2020","20/1/2020","21/1/2020","21/12/2022",
"22/12/2022","18/12/2022","19/12/2022"),
param = c("pA", "pA", "pA", "pB", "pB", "pA", "pA", "pB", "pB"),
value = c("pA_1_1","pA_1_2","pA_1_3",
"pB_1_1","pB_1_2","pA_2_1","pA_2_2","pB_2_1","pB_2_2")
)
DF2=data.frame(
stringsAsFactors = FALSE,
ID = c("id1", "id2"),
date2 = c("15/1/2020", "20/12/2020")
)
CodePudding user response:
This is my solution. I'm sure there is a way to write this with less code (using one dataframe instead of two and later merging). But I don't know righ now.
library(tidyverse)
library(lubridate)
# Get before date2
before <- DF1 %>%
left_join(DF2,by = "ID") %>%
mutate(diff = dmy(date1)-dmy(date2)) %>%
mutate(Grp = data.table::rleid(param)) %>%
filter(diff < 0) %>%
group_by(Grp) %>%
filter(diff == max(diff)) %>%
ungroup
# Get after date2
after <- DF1 %>%
left_join(DF2,by = "ID") %>%
mutate(diff = dmy(date1)-dmy(date2)) %>%
mutate(Grp = data.table::rleid(param)) %>%
filter(diff > 0) %>%
group_by(Grp) %>%
filter(! Grp %in% before$Grp, diff == min(diff)) %>%
ungroup
result <- bind_rows(before,after) %>%
select(ID,date2, param, value, date1) %>%
arrange(ID, param)
Explanation: I'm using lubridate library to compare the dates. I do the same process to create two dataframes - first one (before df) for groups which accomplish first condition (closest date in DF1 before date2 in DF2), second one (after df) is for groups which do the other way round (closest date in DF1 after date2 in DF2).
I will explain first:
# Get before date2
before <- DF1 %>%
left_join(DF2,by = "ID") %>%
mutate(diff = dmy(date1)-dmy(date2)) %>%
mutate(Grp = data.table::rleid(param)) %>%
filter(diff < 0) %>%
group_by(Grp) %>%
filter(diff == max(diff)) %>%
ungroup
Here, we merge DF1 and DF2 by ID, so rows with same ID have the same date2. Then, we calculate the differences date1-date2 - first we convert characters to date using dmy()
. Therefore, dates before date2 will be a negative difference. With data.table::rleid(param)
we enumerate subgroups with different ID & param, so we can know the subgroups. Then we can group by then and filter by them.
At the end:
result <- bind_rows(before,after) %>%
select(ID,date2, param, value, date1) %>%
arrange(ID, param)
We bind the two dataframe by rows and select the columns you are looking for, to delete the columns we created to operate with (group and filter). PS: I added arrange() to make sure the final df is sorted by ID and param values.