Home > Software engineering >  Finding first non missing date within a group
Finding first non missing date within a group

Time:11-25

I have a long dataframe of survey responses for multiple individuals. It looks like this:

id <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
status <- c("Missed", "Missed", "Answered", "Answered", "Missed", "Missed", "Missed", "Missed", "Answered", "Answered", "Answered", "Missed")
date <- seq.Date(as.Date("2021-09-01"), as.Date("2021-09-12"), 'day')
reminder.date <- c(as.Date(NA), as.Date(NA), as.Date("2021-09-3"), as.Date("2021-09-04"), as.Date(NA), as.Date(NA), as.Date(NA), as.Date(NA), as.Date("2021-09-09"), as.Date("2021-09-10"), as.Date("2021-09-11"), as.Date(NA))
dat <-data.frame(id, status, date, ReminderDate = as.Date(reminder.date))
dat

       id   status       date ReminderDate
1   1   Missed 2021-09-01         <NA>
2   1   Missed 2021-09-02         <NA>
3   1 Answered 2021-09-03   2021-09-03
4   1 Answered 2021-09-04   2021-09-04
5   2   Missed 2021-09-05         <NA>
6   2   Missed 2021-09-06         <NA>
7   2   Missed 2021-09-07         <NA>
8   2   Missed 2021-09-08         <NA>
9   3 Answered 2021-09-09   2021-09-09
10  3 Answered 2021-09-10   2021-09-10
11  3 Answered 2021-09-11   2021-09-11
12  3   Missed 2021-09-12         <NA>

What I need to do is identify individuals who haven't answered the survey in a certain number of days. To do this I need to find the first date that they survey reminder was sent (ReminderDate) for each individual. I then need the date of the last observation for each individual where the status = "Answered". Eventually I would want the dataset to look something like this:

      id   status       date ReminderDate  FirstDate     LastDate
1   1   Missed 2021-09-01         <NA>     2021-09-03    2021-09-04
2   1   Missed 2021-09-02         <NA>     2021-09-03    2021-09-04
3   1 Answered 2021-09-03   2021-09-03     2021-09-03    2021-09-04
4   1 Answered 2021-09-04   2021-09-04     2021-09-03    2021-09-04
5   2   Missed 2021-09-05         <NA>     <NA>          <NA>
6   2   Missed 2021-09-06         <NA>     <NA>          <NA>
7   2   Missed 2021-09-07         <NA>     <NA>          <NA>
8   2   Missed 2021-09-08         <NA>     <NA>          <NA>
9   3 Answered 2021-09-09   2021-09-09     2021-09-09    2021-09-11
10  3 Answered 2021-09-10   2021-09-10     2021-09-09    2021-09-11
11  3 Answered 2021-09-11   2021-09-11     2021-09-09    2021-09-11
12  3   Missed 2021-09-12         <NA>     2021-09-09    2021-09-11

I can't figure out how to do all of this within each ID, could someone put me on the right track? I'm coming from SAS and I would use a by statement there but I'm not sure how to do this in R.

Thank you!

EDIT: All the replies have been incredibly helpful, but I've found another case that complicates things a bit. One of the individuals looks like this:

  4 Answered 2021-09-09   2021-09-09
  4 Answered 2021-09-10   2021-09-10
  4 Answered 2021-09-11   2021-09-11
  4   Missed 2021-09-12   2021-09-12

Meaning they had a reminder sent out, but they still missed the survey. How do I capture last date this way?

CodePudding user response:

If there are any dates which are not the same as reminder I think this covers it:

Opps I missed accounting for the lack of NAs in the date column in the first version this should be the correct behavior.

2nd update - this fixes the order of the output when your special case is included

dat %>% dplyr::group_by(id) %>%
    dplyr::arrange(id, date) %>%
    dplyr::mutate(
        all_missed = if_else(all(status == "Missed"), TRUE, FALSE),
        date_NA = if_else(status == "Missed", as.Date(NA), date),
        FirstDate = if_else(all_missed, as.Date(NA), suppressWarnings(min(date_NA, na.rm = TRUE))),
        LastDate = if_else(all_missed, as.Date(NA), suppressWarnings(max(date_NA, na.rm = TRUE)))
    ) %>%
    dplyr::select(-all_missed, -date_NA) %>%
    dplyr::ungroup()

CodePudding user response:

With dplyr::group_by() and mutate you could try this...

The fact that there are no dates in ReminderDate for an id where all appointments are missed seems to render the requirement to make the LastDate conditional on status unnecessary.

This script does result in a number of warnings following the calls to mutate.


library(dplyr)


  dat %>% 
  group_by(id) %>% 
  mutate(FirstDate = min(ReminderDate, na.rm = TRUE),
         LastDate = max(ReminderDate, na.rm = TRUE))

#> # A tibble: 12 x 6
#> # Groups:   id [3]
#>       id status   date       ReminderDate FirstDate  LastDate  
#>    <dbl> <chr>    <date>     <date>       <date>     <date>    
#>  1     1 Missed   2021-09-01 NA           2021-09-03 2021-09-04
#>  2     1 Missed   2021-09-02 NA           2021-09-03 2021-09-04
#>  3     1 Answered 2021-09-03 2021-09-03   2021-09-03 2021-09-04
#>  4     1 Answered 2021-09-04 2021-09-04   2021-09-03 2021-09-04
#>  5     2 Missed   2021-09-05 NA           NA         NA        
#>  6     2 Missed   2021-09-06 NA           NA         NA        
#>  7     2 Missed   2021-09-07 NA           NA         NA        
#>  8     2 Missed   2021-09-08 NA           NA         NA        
#>  9     3 Answered 2021-09-09 2021-09-09   2021-09-09 2021-09-11
#> 10     3 Answered 2021-09-10 2021-09-10   2021-09-09 2021-09-11
#> 11     3 Answered 2021-09-11 2021-09-11   2021-09-09 2021-09-11
#> 12     3 Missed   2021-09-12 NA           2021-09-09 2021-09-11

Created on 2021-11-24 by the reprex package (v2.0.1)

CodePudding user response:

just remove all NAs and get the first date

library(dplyr)

dat %>% drop_na(ReminderDate) %>%
  group_by(id) %>% slice_min(date)
#> # A tibble: 2 × 4
#> # Groups:   id [2]
#>      id status   date       ReminderDate
#>   <dbl> <chr>    <date>     <date>      
#> 1     1 Answered 2021-09-03 2021-09-03  
#> 2     3 Answered 2021-09-09 2021-09-09
  •  Tags:  
  • r
  • Related