I have a dataframe with columns 'person', 'NoShow', and 'date'.
Person NoShow date
1 123 0 2019-01-01
2 123 1 2019-04-01
3 123 0 2020-01-01
4 123 1 2021-01-01
5 123 0 2021-04-30
6 123 0 2022-01-05
7 334 0 2019-07-07
8 334 1 2019-10-11
9 334 1 2020-07-07
10 334 0 2021-01-04
Code to create the dataframe:
df <- data.frame(Person = c('123','123','123','123','123','123','334','334','334','334'),
NoShow = c(0,1,0,1,0,0,0,1,1,0),
date = c('2019-01-01','2019-04-01','2020-01-01','2021-01-01','2021-04-30','2022-01-05','2019-07-07','2019-10-11','2020-07-07','2021-01-04')
)
I used the lag function to create two new columns: (1) 'prior_noshow', which indicates whether a person missed any previous appointments; and (2) 'prior_noshow_f', which indicates the number of previous appointments a person missed.
This is the code I used to create these two additional columns:
library(tidyverse)
df %>%
group_by(Person) %>%
mutate(prior_noshow = lag(as.numeric(cumsum(NoShow) > 0)),
prior_noshow_f = lag(cumsum(NoShow)))
This is the dataframe with the two new columns:
Person NoShow date prior_noshow prior_noshow_f
<chr> <dbl> <chr> <dbl> <dbl>
1 123 0 2019-01-01 NA NA
2 123 1 2019-04-01 0 0
3 123 0 2020-01-01 1 1
4 123 1 2021-01-01 1 1
5 123 0 2021-04-30 1 2
6 123 0 2022-01-05 1 2
7 334 0 2019-07-07 NA NA
8 334 1 2019-10-11 0 0
9 334 1 2020-07-07 1 1
10 334 0 2021-01-04 1 2
Now, I'd also like to add a column 'prior_noshow_date', which should show the date of the most recent no-show per person.
So, for example, rows 3 and 4 should have prior_noshow_date==2019-04-01 and rows 5 and 6 should have prior_noshow_date==2021-01-01.
CodePudding user response:
Using data.table
, we can filter for each case.
library(data.table)
setDT(df)
df$PrevMiss <- sapply(
1:dim(df)[1], function(x) df[1:x, ][
NoShow == 1 & Person == last(Person) & date != last(date) , max(date)])
Output:
> df
Person NoShow date PrevMiss
1: 123 0 2019-01-01 <NA>
2: 123 1 2019-04-01 <NA>
3: 123 0 2020-01-01 2019-04-01
4: 123 1 2021-01-01 2019-04-01
5: 123 0 2021-04-30 2021-01-01
6: 123 0 2022-01-05 2021-01-01
7: 334 0 2019-07-07 <NA>
8: 334 1 2019-10-11 <NA>
9: 334 1 2020-07-07 2019-10-11
10: 334 0 2021-01-04 2020-07-07
CodePudding user response:
Continuing within tidyverse, we can do:
df %>%
group_by(Person) %>%
mutate(prior_noshow = lag(as.numeric(cumsum(NoShow) > 0)),
prior_noshow_f = lag(cumsum(NoShow)),
last_noshow = ifelse(c(FALSE, diff(prior_noshow_f))!= 0, date, NA)) %>%
fill(last_noshow)
#> # A tibble: 10 x 6
#> # Groups: Person [2]
#> Person NoShow date prior_noshow prior_noshow_f last_noshow
#> <chr> <dbl> <chr> <dbl> <dbl> <chr>
#> 1 123 0 2019-01-01 NA NA <NA>
#> 2 123 1 2019-04-01 0 0 <NA>
#> 3 123 0 2020-01-01 1 1 2020-01-01
#> 4 123 1 2021-01-01 1 1 2020-01-01
#> 5 123 0 2021-04-30 1 2 2021-04-30
#> 6 123 0 2022-01-05 1 2 2021-04-30
#> 7 334 0 2019-07-07 NA NA <NA>
#> 8 334 1 2019-10-11 0 0 <NA>
#> 9 334 1 2020-07-07 1 1 2020-07-07
#> 10 334 0 2021-01-04 1 2 2021-01-04
Created on 2022-08-22 with reprex v2.0.2