Have a tricky query I'm trying to run on a data set (~200k rows) with lots of date fields.
A pared down reprex would be like this:
df <- data.frame(index = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3),
effective_date = as.Date(c("2020-01-01", "2020-06-13", "2021-01-01", "2022-01-01", "2023-01-01", "2024-01-01", "2024-04-13", "2025-01-01", "2025-03-01", "2026-01-01", "2015-01-01", "2016-01-01")),
end_date = as.Date(c("2021-01-01", NA, "2022-01-01", "2023-01-01", "2024-01-01", "2025-01-01", NA, "2026-01-01", NA, "2027-01-01", "2016-01-01", "2017-01-01")),
type = c("normal", "SPECIAL", "normal", "normal", "normal", "normal", "SPECIAL", "normal", "SPECIAL", "normal", "normal", "normal"))
Logic as follows:
The end_date
for all records where type
= "SPECIAL"
is unknown and needs to be calculated.
Calculate as follows: for each record where type
= "SPECIAL"
retrieve the end_date
for the "normal"
records (for that particular index
value)
pick the earliest end_date
from that list which is still >= effective_date
for the "SPECIAL"
record.
This would give the following results:
df_2 <- data.frame(index = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3),
effective_date = as.Date(c("2020-01-01", "2020-06-13", "2021-01-01", "2022-01-01", "2023-01-01", "2024-01-01", "2024-04-13", "2025-01-01", "2025-03-01", "2026-01-01", "2015-01-01", "2016-01-01")),
end_date = as.Date(c("2021-01-01", NA, "2022-01-01", "2023-01-01", "2024-01-01", "2025-01-01", NA, "2026-01-01", NA, "2027-01-01", "2016-01-01", "2017-01-01")),
type = c("normal", "SPECIAL", "normal", "normal", "normal", "normal", "SPECIAL", "normal", "SPECIAL", "normal", "normal", "normal"),
calc_end_date = as.Date(c(NA, "2021-01-01", NA, NA, NA, NA, "2025-01-01", NA, "2026-01-01", NA, NA, NA)))
(constructed this dummy data by hand and I'm tired - I hope I didn't make any mistakes!)
I'm hoping there's a simple way to do this using dplyr
type functions (I like dplyr
), but I'm not an expert and I'm struggling to see how to do this.
Thank you.
CodePudding user response:
Using dplyr
1.0.99.9000, we can take the SPECIAL rows and join them to the closest non-SPECIAL row with same index and equal or later end_date.
devtools::install_github("tidyverse/dplyr")
left_join(df %>% filter(type == "SPECIAL"),
df %>% filter(type != "SPECIAL"),
join_by(index, closest(effective_date <= end_date))) %>%
select(index, effective_date = effective_date.x,
calc_end_date = end_date.y) -> updates
df %>%
left_join(updates)
Result
Joining with `by = join_by(index, effective_date)`
index effective_date end_date type calc_end_date
1 1 2020-01-01 2021-01-01 normal <NA>
2 1 2020-06-13 <NA> SPECIAL 2021-01-01
3 1 2021-01-01 2022-01-01 normal <NA>
4 1 2022-01-01 2023-01-01 normal <NA>
5 2 2023-01-01 2024-01-01 normal <NA>
6 2 2024-01-01 2025-01-01 normal <NA>
7 2 2024-04-13 <NA> SPECIAL 2025-01-01
8 2 2025-01-01 2026-01-01 normal <NA>
9 2 2025-03-01 <NA> SPECIAL 2026-01-01
10 2 2026-01-01 2027-01-01 normal <NA>
11 3 2015-01-01 2016-01-01 normal <NA>
12 3 2016-01-01 2017-01-01 normal <NA>
The "updates" table could be created less efficiently with dplyr <1.1 using a cartesian join followed by filtering. This can be fine for small data, but could explode inefficiently if there are many observations per index.
left_join(df %>% filter(type == "SPECIAL"),
df %>% filter(type != "SPECIAL"),
by = c("index")) %>%
arrange(index, effective_date.x, end_date.y) %>%
group_by(index, effective_date.x) %>%
filter(end_date.y >= effective_date.x) %>%
slice(1) %>%
select(index, effective_date = effective_date.x, calc_end_date = end_date.y) %>%
ungroup() -> updates