I have a dataset with over 1000 unique IDs and for each ID about 15 Surgery Codes done on different Dates(recorded as Days Diff)
I want to take only 1 record per 30 days within the group of each surgery code for each ID.
Adding a demo data here:
ID Age Diag.Date Surgery.Code Days.diff
1 1 67 4/8/2011 A364 421
2 1 67 4/8/2011 A364 1197
3 1 67 4/8/2011 A364 2207
4 1 67 4/8/2011 A364 2226
5 1 67 4/8/2011 A364 2247
6 1 67 4/8/2011 A364 2254
7 1 67 4/8/2011 A364 2331
8 1 67 4/8/2011 A364 2367
9 1 67 4/8/2011 A364 2905
10 1 67 4/8/2011 A364 2918
11 1 67 4/8/2011 D365 2200
12 1 67 4/8/2011 D441 308
13 1 67 4/8/2011 D443 218
14 1 67 4/8/2011 A446 308
15 2 56 6/4/2018 A453 2260
16 2 56 6/4/2018 D453 645
17 2 56 6/4/2018 D453 3095
18 2 56 6/4/2018 B453 645
Diff of 2226-2207 days is 19 days so row4 will delete, again diff of 2247-2207 days is 40 days so row5 will get recorded. Again diff of 2254-2247 days is 7 days so row6 will get deleted. Similarly, row10 will get deleted.
Any help is appreciated!
CodePudding user response:
- Use
dplyr::group_by(ID, Surgery.Code)
to filter within individuals and surgeries; - Within each group, use
Days.diff - dplyr::lag(Days.diff) <= 30
to test for adjacent rows within 30 days; - Because the results of (2) may change when rows are removed, you'll want to iterate by removing one row at a time per group, then re-testing. You can use
while
to iterate until no more cases are detected.
library(dplyr)
filtered <- surgeries %>%
group_by(ID, Surgery.Code) %>%
mutate(within30 = if_else(
Days.diff - lag(Days.diff) <= 30,
row_number(),
NA_integer_
))
while (any(!is.na(filtered$within30))) {
filtered <- filtered %>%
mutate(within30 = if_else(
Days.diff - lag(Days.diff) <= 30,
row_number(),
NA_integer_
)) %>%
filter(is.na(within30) | within30 != min(within30, na.rm = TRUE))
}
filtered %>%
select(!within30) %>%
ungroup()
#> # A tibble: 15 x 5
#> ID Age Diag.Date Surgery.Code Days.diff
#> <int> <int> <chr> <chr> <int>
#> 1 1 67 4/8/2011 A364 421
#> 2 1 67 4/8/2011 A364 1197
#> 3 1 67 4/8/2011 A364 2207
#> 4 1 67 4/8/2011 A364 2247
#> 5 1 67 4/8/2011 A364 2331
#> 6 1 67 4/8/2011 A364 2367
#> 7 1 67 4/8/2011 A364 2905
#> 8 1 67 4/8/2011 D365 2200
#> 9 1 67 4/8/2011 D441 308
#> 10 1 67 4/8/2011 D443 218
#> 11 1 67 4/8/2011 A446 308
#> 12 2 56 6/4/2018 A453 2260
#> 13 2 56 6/4/2018 D453 645
#> 14 2 56 6/4/2018 D453 3095
#> 15 2 56 6/4/2018 B453 645
Created on 2022-03-01 by the reprex package (v2.0.1)