Awkward Date Related Query in R/dplyr-CodePudding

Have a tricky query I'm trying to run on a data set (~200k rows) with lots of date fields.

A pared down reprex would be like this:

df <- data.frame(index          = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3),
                 effective_date = as.Date(c("2020-01-01", "2020-06-13", "2021-01-01", "2022-01-01", "2023-01-01", "2024-01-01", "2024-04-13", "2025-01-01", "2025-03-01", "2026-01-01", "2015-01-01", "2016-01-01")),
                 end_date       = as.Date(c("2021-01-01", NA, "2022-01-01", "2023-01-01", "2024-01-01", "2025-01-01", NA, "2026-01-01", NA, "2027-01-01", "2016-01-01", "2017-01-01")),
                 type           = c("normal", "SPECIAL", "normal", "normal", "normal", "normal", "SPECIAL", "normal", "SPECIAL", "normal", "normal", "normal"))

Logic as follows:

The end_date for all records where type = "SPECIAL" is unknown and needs to be calculated.

Calculate as follows: for each record where type = "SPECIAL"

retrieve the end_date for the "normal" records (for that particular index value) pick the earliest end_date from that list which is still >= effective_date for the "SPECIAL" record.

This would give the following results:

df_2 <- data.frame(index          = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3),
                   effective_date = as.Date(c("2020-01-01", "2020-06-13", "2021-01-01", "2022-01-01", "2023-01-01", "2024-01-01", "2024-04-13", "2025-01-01", "2025-03-01", "2026-01-01", "2015-01-01", "2016-01-01")),
                   end_date       = as.Date(c("2021-01-01", NA, "2022-01-01", "2023-01-01", "2024-01-01", "2025-01-01", NA, "2026-01-01", NA, "2027-01-01", "2016-01-01", "2017-01-01")),
                   type           = c("normal", "SPECIAL", "normal", "normal", "normal", "normal", "SPECIAL", "normal", "SPECIAL", "normal", "normal", "normal"),
                   calc_end_date  = as.Date(c(NA, "2021-01-01", NA, NA, NA, NA, "2025-01-01", NA, "2026-01-01", NA, NA, NA)))

(constructed this dummy data by hand and I'm tired - I hope I didn't make any mistakes!)

I'm hoping there's a simple way to do this using dplyr type functions (I like dplyr), but I'm not an expert and I'm struggling to see how to do this.

Thank you.

CodePudding user response：

Using dplyr 1.0.99.9000, we can take the SPECIAL rows and join them to the closest non-SPECIAL row with same index and equal or later end_date.

devtools::install_github("tidyverse/dplyr")

left_join(df %>% filter(type == "SPECIAL"),
          df %>% filter(type != "SPECIAL"), 
          join_by(index, closest(effective_date <= end_date))) %>%
  select(index, effective_date = effective_date.x, 
         calc_end_date = end_date.y) -> updates
  
df %>%
  left_join(updates)

Result

Joining with `by = join_by(index, effective_date)`
   index effective_date   end_date    type calc_end_date
1      1     2020-01-01 2021-01-01  normal          <NA>
2      1     2020-06-13       <NA> SPECIAL    2021-01-01
3      1     2021-01-01 2022-01-01  normal          <NA>
4      1     2022-01-01 2023-01-01  normal          <NA>
5      2     2023-01-01 2024-01-01  normal          <NA>
6      2     2024-01-01 2025-01-01  normal          <NA>
7      2     2024-04-13       <NA> SPECIAL    2025-01-01
8      2     2025-01-01 2026-01-01  normal          <NA>
9      2     2025-03-01       <NA> SPECIAL    2026-01-01
10     2     2026-01-01 2027-01-01  normal          <NA>
11     3     2015-01-01 2016-01-01  normal          <NA>
12     3     2016-01-01 2017-01-01  normal          <NA>

The "updates" table could be created less efficiently with dplyr <1.1 using a cartesian join followed by filtering. This can be fine for small data, but could explode inefficiently if there are many observations per index.

left_join(df %>% filter(type == "SPECIAL"),
          df %>% filter(type != "SPECIAL"), 
          by = c("index")) %>%
  arrange(index, effective_date.x, end_date.y) %>%
  group_by(index, effective_date.x) %>%
  filter(end_date.y >= effective_date.x) %>%
  slice(1) %>%
  select(index, effective_date = effective_date.x, calc_end_date = end_date.y) %>%
  ungroup() -> updates