I am trying to calculate length of stay for clients; this is not a field in the data, but each row is a separate day, with consecutive days belonging to the same "stay" - so if there are 5 rows for the same client all with consecutive days, their length of stay would be 5 days. However, there are a lot of different clients, and each client could have more than one stay - if the dates of service are not consecutive, they would count as separate stays. For example:
client id | date of service |
---|---|
123 | 1/1/22 |
123 | 1/2/22 |
123 | 1/3/22 |
123 | 1/4/22 |
123 | 1/5/22 |
456 | 1/1/22 |
456 | 1/4/22 |
456 | 1/5/22 |
456 | 1/6/22 |
789 | 1/1/22 |
789 | 1/2/22 |
789 | 1/5/22 |
789 | 1/6/22 |
789 | 1/9/22 |
789 | 1/10/22 |
For this fake data, client '123' would have 1 stay with a length of stay of 5 days; client '456' would have 2 stays, one with a length of stay of 1 day and the other with a length of stay of 3 days; and client '789' would have 3 stays, each with a length of stay of 2 days.
My plan was to group the data by client id, and then order by date of service, and then maybe use some kind of for loop to check if the date of service is 1 day later and if it is, add it to the length of stay and if it isn't, start a new "stay"; but this seems clunky and inefficient, and I'm not entirely sure it would work correctly. There are thousands of rows in the data and hundreds of clients.
Ideally, my output would provide the client id and their calculated length of stay(s); if possible, it would be excellent to also have the output provide the starting date for each stay as well.
Is there a way to accomplish this in R?
I appreciate any help you can provide. Thank you!
CodePudding user response:
We may use
library(dplyr)
library(lubridate)
df1 %>%
mutate(dateofservice = mdy(dateofservice)) %>%
group_by(clientid) %>%
mutate(grp = cumsum(c(TRUE, diff(dateofservice) != 1))) %>%
group_by(grp, .add = TRUE) %>%
summarise(len = n(), start_date = first(dateofservice), .groups = 'drop')
-output
# A tibble: 6 × 4
clientid grp len start_date
<int> <int> <int> <date>
1 123 1 5 2022-01-01
2 456 1 1 2022-01-01
3 456 2 3 2022-01-04
4 789 1 2 2022-01-01
5 789 2 2 2022-01-05
6 789 3 2 2022-01-09
data
df1 <- structure(list(clientid = c(123L, 123L, 123L, 123L, 123L,
456L,
456L, 456L, 456L, 789L, 789L, 789L, 789L, 789L, 789L),
dateofservice = c("1/1/22",
"1/2/22", "1/3/22", "1/4/22", "1/5/22", "1/1/22", "1/4/22", "1/5/22",
"1/6/22", "1/1/22", "1/2/22", "1/5/22", "1/6/22", "1/9/22", "1/10/22"
)), class = "data.frame", row.names = c(NA, -15L))