Last date by group with conditions-CodePudding

This question is a follow-up to this question, but where each idPerson can have multiple decision == "d". There are multiple idPerson, but one suffices to explain the question. idAppt are nested into idPerson. Consider this data frame.

   idPerson idAppt decision date      
 1 A             1 a        2021-09-10
 2 A             1 b        2021-09-11
 3 A             1 c        2021-09-12
 4 A             1 d        2021-09-13
 5 A             2 a        2021-09-20
 6 A             2 b        2021-09-21
 7 A             3 a        2021-09-10
 8 A             3 b        2021-09-11
 9 A             4 a        2021-09-21
10 A             4 b        2021-09-22
11 A             4 c        2021-09-23
12 A             4 d        2021-09-24
13 A             5 a        2021-09-10
14 A             5 b        2021-09-11
15 A             6 a        2021-10-10
16 A             6 b        2021-10-11

I'd like to construct a date2 column which replies to these conditions:

For a given idAppt, if decision == "a" is later than any other date when decision == "d" of that same idPerson, report the latest value of date when decision == "d" for that idPerson (the closest before). For example, in group idAppt == 2, the date of decision == "a" is later than the date of decision == "d" of group idAppt == 1, so date2 should be 2021-09-13. Same applies for group idAppt == 6, but here there are two decision == "d" that are earlier (row 4 and 12). In that case, date2 should be the closest before 2021-10-10, i.e. 2021-09-23.
When there is no decision == "d"'s date earlier than the date of decision == "a" for a given idAppt, take the earliest of the given idPerson.

Which gives the following desired output:

   idPerson idAppt decision date       date2       
 1 A             1 a        2021-09-10 2021-09-10
 2 A             1 b        2021-09-11 2021-09-10
 3 A             1 c        2021-09-12 2021-09-10
 4 A             1 d        2021-09-13 2021-09-10
 5 A             2 a        2021-09-20 2021-09-13 #<- correspond to value of row 4
 6 A             2 b        2021-09-21 2021-09-13  
 7 A             3 a        2021-09-10 2021-09-10 
 8 A             3 b        2021-09-11 2021-09-10
 9 A             4 a        2021-09-21 2021-09-13
10 A             4 b        2021-09-22 2021-09-13
11 A             4 c        2021-09-23 2021-09-13
12 A             4 d        2021-09-24 2021-09-13
13 A             5 a        2021-09-11 2021-09-10 #<- earliest value because 2021-09-10 is earlier than 2021-09-13
14 A             5 b        2021-09-12 2021-09-10
15 A             6 a        2021-10-10 2021-09-24 #<- correspond to value of row 12
16 A             6 b        2021-10-11 2021-09-24

data

df <- structure(list(idPerson = c("A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A"), idAppt = c(1L, 
1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L), 
    decision = c("a", "b", "c", "d", "a", "b", "a", "b", "a", 
    "b", "c", "d", "a", "b", "a", "b"), date = structure(c(18880, 
    18881, 18882, 18883, 18890, 18891, 18880, 18881, 18891, 18892, 
    18893, 18894, 18881, 18882, 18910, 18911), class = "Date")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -16L))

EO <- structure(list(idPerson = c("A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A"), idAppt = c(1L, 
1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L), 
    decision = c("a", "b", "c", "d", "a", "b", "a", "b", "a", 
    "b", "c", "d", "a", "b", "a", "b"), date = structure(c(18880, 
    18881, 18882, 18883, 18890, 18891, 18880, 18881, 18891, 18892, 
    18893, 18894, 18881, 18882, 18910, 18911), class = "Date"), 
    date2 = c("2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", 
    "2021-09-13", "2021-09-13", "2021-09-10", "2021-09-10", "2021-09-13", 
    "2021-09-13", "2021-09-13", "2021-09-13", "2021-09-10", "2021-09-10", 
    "2021-09-24", "2021-09-24")), row.names = c(NA, -16L), class = c("tbl_df", 
"tbl", "data.frame"))

CodePudding user response：

Here's how I solved the problem, although it seems a bit convoluted:

library(dplyr)
df %>%
  group_by(idPerson) %>%
  mutate(d_date = list(date[decision == "d"]), min_date_person = min(date)) %>% 
  group_by(idPerson, idAppt) %>%
  mutate(date3 = unlist(map(d_date, \(x){
          dates <- date[decision == "a"] - x
          w <- which.min(dates[dates > 0])
          ifelse(is.null(w), NA, w)
          })),
         date2 = if_else(is.na(date3), min_date_person, do.call("c", map(d_date, ~ unique(.x[date3]))))) %>% 
  ungroup() %>% 
  select(1:4, date2)

# A tibble: 16 × 5
   idPerson idAppt decision date       date2     
   <chr>     <int> <chr>    <date>     <date>    
 1 A             1 a        2021-09-10 2021-09-10
 2 A             1 b        2021-09-11 2021-09-10
 3 A             1 c        2021-09-12 2021-09-10
 4 A             1 d        2021-09-13 2021-09-10
 5 A             2 a        2021-09-20 2021-09-13
 6 A             2 b        2021-09-21 2021-09-13
 7 A             3 a        2021-09-10 2021-09-10
 8 A             3 b        2021-09-11 2021-09-10
 9 A             4 a        2021-09-21 2021-09-13
10 A             4 b        2021-09-22 2021-09-13
11 A             4 c        2021-09-23 2021-09-13
12 A             4 d        2021-09-24 2021-09-13
13 A             5 a        2021-09-11 2021-09-10
14 A             5 b        2021-09-12 2021-09-10
15 A             6 a        2021-10-10 2021-09-24
16 A             6 b        2021-10-11 2021-09-24

CodePudding user response：

Using a data.table rolling join:

library(data.table)
setDT(df)

# rolling join between decision "d" and "a"
df[decision == "a", date2 := df[decision == "d"][.SD, on = .(idPerson, date), x.date, roll = Inf]]

# set non-matching rows for decision "a" to min(date)
df[decision == "a" & is.na(date2), date2 := min(date), by = idPerson]

# replace other NA by last observation carried forward
setnafill(df, type = "locf", cols = "date2")

    idPerson idAppt decision       date      date2
 1:        A      1        a 2021-09-10 2021-09-10
 2:        A      1        b 2021-09-11 2021-09-10
 3:        A      1        c 2021-09-12 2021-09-10
 4:        A      1        d 2021-09-13 2021-09-10
 5:        A      2        a 2021-09-20 2021-09-13
 6:        A      2        b 2021-09-21 2021-09-13
 7:        A      3        a 2021-09-10 2021-09-10
 8:        A      3        b 2021-09-11 2021-09-10
 9:        A      4        a 2021-09-21 2021-09-13
10:        A      4        b 2021-09-22 2021-09-13
11:        A      4        c 2021-09-23 2021-09-13
12:        A      4        d 2021-09-24 2021-09-13
13:        A      5        a 2021-09-11 2021-09-10
14:        A      5        b 2021-09-12 2021-09-10
15:        A      6        a 2021-10-10 2021-09-24
16:        A      6        b 2021-10-11 2021-09-24

The relevance of 'idAppt' is not entirely clear, because it seems like the comparison of dates is performed within idPerson.