Home > other >  How can I use the lag function to get the date of the most recent occurrence?
How can I use the lag function to get the date of the most recent occurrence?

Time:08-24

I have a dataframe with columns 'person', 'NoShow', and 'date'.

   Person NoShow       date
1     123      0 2019-01-01
2     123      1 2019-04-01
3     123      0 2020-01-01
4     123      1 2021-01-01
5     123      0 2021-04-30
6     123      0 2022-01-05
7     334      0 2019-07-07
8     334      1 2019-10-11
9     334      1 2020-07-07
10    334      0 2021-01-04

Code to create the dataframe:

df <- data.frame(Person = c('123','123','123','123','123','123','334','334','334','334'), 
                   NoShow = c(0,1,0,1,0,0,0,1,1,0), 
                   date = c('2019-01-01','2019-04-01','2020-01-01','2021-01-01','2021-04-30','2022-01-05','2019-07-07','2019-10-11','2020-07-07','2021-01-04')
                   )

I used the lag function to create two new columns: (1) 'prior_noshow', which indicates whether a person missed any previous appointments; and (2) 'prior_noshow_f', which indicates the number of previous appointments a person missed.

This is the code I used to create these two additional columns:

library(tidyverse)

df %>% 
  group_by(Person) %>%
  mutate(prior_noshow = lag(as.numeric(cumsum(NoShow) > 0)),
         prior_noshow_f = lag(cumsum(NoShow)))

This is the dataframe with the two new columns:

   Person NoShow date       prior_noshow prior_noshow_f
   <chr>   <dbl> <chr>             <dbl>          <dbl>
 1 123         0 2019-01-01           NA             NA
 2 123         1 2019-04-01            0              0
 3 123         0 2020-01-01            1              1
 4 123         1 2021-01-01            1              1
 5 123         0 2021-04-30            1              2
 6 123         0 2022-01-05            1              2
 7 334         0 2019-07-07           NA             NA
 8 334         1 2019-10-11            0              0
 9 334         1 2020-07-07            1              1
10 334         0 2021-01-04            1              2

Now, I'd also like to add a column 'prior_noshow_date', which should show the date of the most recent no-show per person.

So, for example, rows 3 and 4 should have prior_noshow_date==2019-04-01 and rows 5 and 6 should have prior_noshow_date==2021-01-01.

CodePudding user response:

Using data.table, we can filter for each case.

library(data.table)
setDT(df)
df$PrevMiss <- sapply(
  1:dim(df)[1], function(x) df[1:x, ][
    NoShow == 1 & Person == last(Person) & date != last(date) , max(date)])

Output:

> df
    Person NoShow       date   PrevMiss
 1:    123      0 2019-01-01       <NA>
 2:    123      1 2019-04-01       <NA>
 3:    123      0 2020-01-01 2019-04-01
 4:    123      1 2021-01-01 2019-04-01
 5:    123      0 2021-04-30 2021-01-01
 6:    123      0 2022-01-05 2021-01-01
 7:    334      0 2019-07-07       <NA>
 8:    334      1 2019-10-11       <NA>
 9:    334      1 2020-07-07 2019-10-11
10:    334      0 2021-01-04 2020-07-07

CodePudding user response:

Continuing within tidyverse, we can do:

df %>% 
  group_by(Person) %>%
  mutate(prior_noshow = lag(as.numeric(cumsum(NoShow) > 0)),
         prior_noshow_f = lag(cumsum(NoShow)),
         last_noshow = ifelse(c(FALSE, diff(prior_noshow_f))!= 0, date, NA)) %>%
  fill(last_noshow)
#> # A tibble: 10 x 6
#> # Groups:   Person [2]
#>    Person NoShow date       prior_noshow prior_noshow_f last_noshow
#>    <chr>   <dbl> <chr>             <dbl>          <dbl> <chr>      
#>  1 123         0 2019-01-01           NA             NA <NA>       
#>  2 123         1 2019-04-01            0              0 <NA>       
#>  3 123         0 2020-01-01            1              1 2020-01-01 
#>  4 123         1 2021-01-01            1              1 2020-01-01 
#>  5 123         0 2021-04-30            1              2 2021-04-30 
#>  6 123         0 2022-01-05            1              2 2021-04-30 
#>  7 334         0 2019-07-07           NA             NA <NA>       
#>  8 334         1 2019-10-11            0              0 <NA>       
#>  9 334         1 2020-07-07            1              1 2020-07-07 
#> 10 334         0 2021-01-04            1              2 2021-01-04

Created on 2022-08-22 with reprex v2.0.2

  • Related