I want to group within a group by date difference.
For example, if there are 7 cases in facility A, but the first 5 cases happened before 14 days of the last 2 cases, I want them to be in two different groups (see below example)
location | address | start_date | start_date_diff | Group |
---|---|---|---|---|
Facility A | 123 main st | 2/7/2022 | 0 | 1 |
Facility A | 123 main st | 2/11/2022 | 4 | 1 |
Facility A | 123 main st | 2/11/2022 | 0 | 1 |
Facility A | 123 main st | 2/11/2022 | 0 | 1 |
Facility A | 123 main st | 2/12/2022 | 1 | 1 |
Facility A | 123 main st | 3/12/2022 | 28 | 2 |
Facility A | 123 main st | 3/17/2022 | 5 | 2 |
Facility B | 55 ford rd | 3/16/2022 | 0 | 3 |
Facility B | 55 ford rd | 3/16/2022 | 0 | 3 |
Facility C | 1 step ave | 3/16/2022 | 0 | 4 |
Facility C | 1 step ave | 3/20/2022 | 4 | 4 |
Facility C | 1 step ave | 3/22/2022 | 2 | 4 |
here is my code so far:
I am stuck on how to group them further by the date difference between individual observations.
CodePudding user response:
Assuming we don't already diff
calculated, and that we need to convert start_date
into something arithmetically useful.
data.table
library(data.table)
as.data.table(dat)[, start_date := as.Date(start_date, format = "%m/%d/%Y")
][, diff14 := cumsum(c(0, diff(start_date)) > 14), by = location
][, Group2 := rleid(location, diff14)][]
# location address start_date start_date_diff Group diff14 Group2
# <char> <char> <Date> <int> <int> <int> <int>
# 1: Facility A 123 main st 2022-02-07 0 1 0 1
# 2: Facility A 123 main st 2022-02-11 4 1 0 1
# 3: Facility A 123 main st 2022-02-11 0 1 0 1
# 4: Facility A 123 main st 2022-02-11 0 1 0 1
# 5: Facility A 123 main st 2022-02-12 1 1 0 1
# 6: Facility A 123 main st 2022-03-12 28 2 1 2
# 7: Facility A 123 main st 2022-03-17 5 2 1 2
# 8: Facility B 55 ford rd 2022-03-16 0 3 0 3
# 9: Facility B 55 ford rd 2022-03-16 0 3 0 3
# 10: Facility C 1 step ave 2022-03-16 0 4 0 4
# 11: Facility C 1 step ave 2022-03-20 4 4 0 4
# 12: Facility C 1 step ave 2022-03-22 2 4 0 4
dplyr
library(dplyr)
dat %>%
mutate(start_date = as.Date(start_date, format = "%m/%d/%Y")) %>%
group_by(location) %>%
mutate(diff14 = cumsum(c(0, diff(start_date)) > 14)) %>%
group_by(location, diff14) %>%
mutate(Group2 = cur_group_id()) %>%
ungroup()
# # A tibble: 12 x 7
# location address start_date start_date_diff Group diff14 Group2
# <chr> <chr> <date> <int> <int> <int> <int>
# 1 Facility A 123 main st 2022-02-07 0 1 0 1
# 2 Facility A 123 main st 2022-02-11 4 1 0 1
# 3 Facility A 123 main st 2022-02-11 0 1 0 1
# 4 Facility A 123 main st 2022-02-11 0 1 0 1
# 5 Facility A 123 main st 2022-02-12 1 1 0 1
# 6 Facility A 123 main st 2022-03-12 28 2 1 2
# 7 Facility A 123 main st 2022-03-17 5 2 1 2
# 8 Facility B 55 ford rd 2022-03-16 0 3 0 3
# 9 Facility B 55 ford rd 2022-03-16 0 3 0 3
# 10 Facility C 1 step ave 2022-03-16 0 4 0 4
# 11 Facility C 1 step ave 2022-03-20 4 4 0 4
# 12 Facility C 1 step ave 2022-03-22 2 4 0 4