I have a dataset of patient visits at a clinic. Each individual patient can visit on multiple occasions. Each patient is identified by a study_id and each visit by an illness_id. I want to iteratively filter the dataframe so that a visit that occurs within 28 days of a previous visit is removed.
I cannot simply calculate the interval between all visits and then remove those which occur within 28 days. The intervals need to be calculated iteratively as the dataframe is filtered.
In the example below you can see patient 0003 presented three times. Visit 1 is always retained. Visit 2 should be removed as it occurred 7 days after Visit 1. Once Visit 2 is removed, Visit 3 would occur 29 days after Visit 1 and so should be retained. However if I calculate all the intervals and then filter out any visits with an interval of 28 days or less, both Visits 2 and 3 would be removed (because Visit 2 occurred 7 days after Visit 1 and Visit 3 occurred 22 days after Visit 2).
study_id | illness_id | illness_date |
---|---|---|
0001 | 000103/12/2007 | 2007/12/03 |
0002 | 000224/03/2008 | 2008/03/24 |
0002 | 000226/04/2008 | 2008/04/26 |
0002 | 000217/07/2008 | 2008/07/17 |
0002 | 000221/08/2008 | 2008/08/21 |
0002 | 000225/08/2008 | 2008/08/25 |
0003 | 000329/09/2008 | 2008/09/29 |
0003 | 000306/10/2008 | 2008/10/06 |
0003 | 000328/10/2008 | 2008/10/28 |
The correctly filtered dataframe should be:
study_id | illness_id | illness_date |
---|---|---|
0001 | 000103/12/2007 | 2007/12/03 |
0002 | 000224/03/2008 | 2008/03/24 |
0002 | 000226/04/2008 | 2008/04/26 |
0002 | 000217/07/2008 | 2008/07/17 |
0002 | 000221/08/2008 | 2008/08/21 |
0003 | 000329/09/2008 | 2008/09/29 |
0003 | 000328/10/2008 | 2008/10/28 |
Thanks for any help - I am new to R and am struggling to get my head around iteration and loops. If there is a simple solution involving dplyr filter that would be great.
CodePudding user response:
This should do the trick:
df %>%
mutate(illness_date = as.Date(illness_date,
format = "%Y/%m/%d")) %>%
group_by(study_id) %>%
mutate(time_since_first_visit = illness_date - min(illness_date)) %>%
filter(time_since_first_visit == 0 | time_since_first_visit > 28)
CodePudding user response:
Here is a function that returns the rows to drop and an example of calling it by group using data.table
.
fFilter <- function(v, gap) {
blnDrop <- logical(length(v))
if (length(v) > 1L) {
prev <- v[1]
for (i in 2:length(v)) {
if (v[i] - prev <= gap) blnDrop[i] <- TRUE else prev <- v[i]
}
}
blnDrop
}
library(data.table)
dt <- data.table(id = rep(1:3, c(1, 5, 3)), date = as.Date(c("2007/12/3", "2008/3/24", "2008/4/26", "2008/7/17", "2008/8/21", "2008/8/25", "2008/9/29", "2008/10/6", "2008/10/28")))
setorder(dt, id, date)
dt[,drop := fFilter(date, 28), by = "id"][drop == FALSE, 1:(length(dt) - 1L)]
#> id date
#> 1: 1 2007-12-03
#> 2: 2 2008-03-24
#> 3: 2 2008-04-26
#> 4: 2 2008-07-17
#> 5: 2 2008-08-21
#> 6: 3 2008-09-29
#> 7: 3 2008-10-28
CodePudding user response:
Here I use purrr::accumulate
to propagate only dates that are more than 28 days from the preceding, otherwise keep the preceding.
library(dplyr)
library(purrr)
df |>
group_by(study_id) |>
arrange(illness_date, by_group = TRUE) |>
mutate(h = purrr::accumulate(illness_date,~ifelse(.y - .x > 28, .y,.x))) |>
filter(h - lag(h, 1,0) > 28) |>
select(-h)
# A tibble: 7 × 3
# Groups: study_id [3]
illness_id illness_date study_id
<chr> <date> <chr>
1 000103/12/2007 2007-12-03 0001
2 000224/03/2008 2008-03-24 0002
3 000226/04/2008 2008-04-26 0002
4 000217/07/2008 2008-07-17 0002
5 000221/08/2008 2008-08-21 0002
6 000329/09/2008 2008-09-29 0003
7 000328/10/2008 2008-10-28 0003