Iterative filtering in R-CodePudding

I have a dataset of patient visits at a clinic. Each individual patient can visit on multiple occasions. Each patient is identified by a study_id and each visit by an illness_id. I want to iteratively filter the dataframe so that a visit that occurs within 28 days of a previous visit is removed.

I cannot simply calculate the interval between all visits and then remove those which occur within 28 days. The intervals need to be calculated iteratively as the dataframe is filtered.

In the example below you can see patient 0003 presented three times. Visit 1 is always retained. Visit 2 should be removed as it occurred 7 days after Visit 1. Once Visit 2 is removed, Visit 3 would occur 29 days after Visit 1 and so should be retained. However if I calculate all the intervals and then filter out any visits with an interval of 28 days or less, both Visits 2 and 3 would be removed (because Visit 2 occurred 7 days after Visit 1 and Visit 3 occurred 22 days after Visit 2).

study_id	illness_id	illness_date
0001	000103/12/2007	2007/12/03
0002	000224/03/2008	2008/03/24
0002	000226/04/2008	2008/04/26
0002	000217/07/2008	2008/07/17
0002	000221/08/2008	2008/08/21
0002	000225/08/2008	2008/08/25
0003	000329/09/2008	2008/09/29
0003	000306/10/2008	2008/10/06
0003	000328/10/2008	2008/10/28

The correctly filtered dataframe should be:

study_id	illness_id	illness_date
0001	000103/12/2007	2007/12/03
0002	000224/03/2008	2008/03/24
0002	000226/04/2008	2008/04/26
0002	000217/07/2008	2008/07/17
0002	000221/08/2008	2008/08/21
0003	000329/09/2008	2008/09/29
0003	000328/10/2008	2008/10/28

Thanks for any help - I am new to R and am struggling to get my head around iteration and loops. If there is a simple solution involving dplyr filter that would be great.

CodePudding user response：

This should do the trick:

df %>% 
  mutate(illness_date = as.Date(illness_date, 
                                format = "%Y/%m/%d")) %>% 
  group_by(study_id) %>% 
  mutate(time_since_first_visit = illness_date - min(illness_date)) %>% 
  filter(time_since_first_visit == 0 | time_since_first_visit > 28)

CodePudding user response：

Here is a function that returns the rows to drop and an example of calling it by group using data.table.

fFilter <- function(v, gap) {
  blnDrop <- logical(length(v))
  if (length(v) > 1L) {
    prev <- v[1]
    
    for (i in 2:length(v)) {
      if (v[i] - prev <= gap) blnDrop[i] <- TRUE else prev <- v[i]
    }
  }
  
  blnDrop
}

library(data.table)

dt <- data.table(id = rep(1:3, c(1, 5, 3)), date = as.Date(c("2007/12/3", "2008/3/24", "2008/4/26", "2008/7/17", "2008/8/21", "2008/8/25", "2008/9/29", "2008/10/6", "2008/10/28")))
setorder(dt, id, date)
dt[,drop := fFilter(date, 28), by = "id"][drop == FALSE, 1:(length(dt) - 1L)]
#>    id       date
#> 1:  1 2007-12-03
#> 2:  2 2008-03-24
#> 3:  2 2008-04-26
#> 4:  2 2008-07-17
#> 5:  2 2008-08-21
#> 6:  3 2008-09-29
#> 7:  3 2008-10-28

CodePudding user response：

Here I use purrr::accumulate to propagate only dates that are more than 28 days from the preceding, otherwise keep the preceding.

library(dplyr)
library(purrr)

df |>
  group_by(study_id) |>
  arrange(illness_date, by_group = TRUE) |>
  mutate(h = purrr::accumulate(illness_date,~ifelse(.y - .x > 28, .y,.x))) |>
  filter(h - lag(h, 1,0) > 28) |>
  select(-h)

  # A tibble: 7 × 3
# Groups:   study_id [3]
  illness_id     illness_date study_id
  <chr>          <date>       <chr>   
1 000103/12/2007 2007-12-03   0001    
2 000224/03/2008 2008-03-24   0002    
3 000226/04/2008 2008-04-26   0002    
4 000217/07/2008 2008-07-17   0002    
5 000221/08/2008 2008-08-21   0002    
6 000329/09/2008 2008-09-29   0003    
7 000328/10/2008 2008-10-28   0003