I have a long dataset (25k rows) where each row represents a dose of a certain med
ication given to a person (id
) on a given day (adm_date
).
I'm now trying to summarise this dataset so that each row represents a course of a med
for a given id
, with columns indicating start date and end date. Each course can last any number of days.
library(dplyr)
df <- tribble(~id, ~med, ~adm_date, ~start, ~end,
1, "antibiotic", "13/01/2022", TRUE, NA,
1, "antibiotic", "14/01/2022", NA, TRUE,
1, "antibiotic", "21/01/2022", TRUE, NA,
1, "antibiotic", "22/02/2022", NA, NA,
1, "antibiotic", "23/02/2022", NA, TRUE,
1, "aspirin", "14/01/2022", TRUE, TRUE,
2, "antibiotic", "14/01/2022", TRUE, NA,
2, "antibiotic", "15/01/2022", NA, TRUE,
2, "aspirin", "14/01/2022", TRUE, TRUE)
Intended output:
id | med | course | course_start | course_end |
---|---|---|---|---|
1 | antibiotic | 1 | 13/01/2022 | 14/01/2022 |
1 | antibiotic | 2 | 21/01/2022 | 23/02/2022 |
1 | aspirin | 1 | 14/01/2022 | 14/01/2022 |
2 | antibiotic | 1 | 14/01/2022 | 15/01/2022 |
2 | aspirin | 1 | 14/01/2022 | 14/01/2022 |
I can't simply use group_by(id, med) %>% summarise(course_start = min(adm_date), course_end = max(adm_date)
because a medication can be prescribed in more than one course (e.g. two instances for the same antibiotic: 13-14 Jan, 21-22 Jan; the code above would merge the two into 13-22 Jan).
My approach is:
- I coded columns
start
andend
to identify the start and end of each course (Done - see sample dataset). - Using these to create a
course
(instance) column. (Don't know how to do this). df |> group_by(id, med, course) |> summarise(course_start = min(adm_date), course_end = max(adm_date)
(This should work).
(Of course, happy with any other approach).
(PS: this is similar to this question, however I don't have another factor that separates instances, so I can't use Stefano's brilliant solution)
CodePudding user response:
df %>%
filter(!(is.na(start) & is.na(end))) %>%
group_by(med, id) %>%
mutate(course_start = if_else(start == T, adm_date, NA_character_),
course_end = if_else(is.na(end), lead(adm_date), course_start)) %>%
drop_na(course_start) %>%
select(id, med, starts_with("course")) %>%
mutate(course = row_number(), .after = med) %>%
ungroup
# A tibble: 5 x 5
id med course course_start course_end
<dbl> <chr> <int> <chr> <chr>
1 1 antibiotic 1 13/01/2022 14/01/2022
2 1 antibiotic 2 21/01/2022 23/02/2022
3 1 aspirin 1 14/01/2022 14/01/2022
4 2 antibiotic 1 14/01/2022 15/01/2022
5 2 aspirin 1 14/01/2022 14/01/2022
CodePudding user response:
- First
group_by(id, med)
so that they're treated as a group - Create a
course_end
column byifelse
. Ifend
isNA
, set the next row ofadm_date
as thecourse_end
. If it's notNA
, use theadm_date
. - Rename
adm_date
tostart_date
- Remove rows that contain
NA
in thestart
column - Remove
start
andend
columns - Create a
course
column by forcingmed
into numeric column and calculate thecumsum
of it.
library(tidyverse)
df %>%
group_by(id, med) %>%
mutate(course_end = ifelse(is.na(end), lead(adm_date), adm_date)) %>%
rename("start_date" = "adm_date") %>%
drop_na(start) %>%
select(-start, -end) %>%
mutate(course = cumsum(as.numeric(as.factor(med))))
# A tibble: 5 × 5
# Groups: id, med [4]
id med start_date course_end course
<dbl> <chr> <chr> <chr> <dbl>
1 1 antibiotic 13/01/2022 14/01/2022 1
2 1 antibiotic 21/01/2022 22/02/2022 2
3 1 aspirin 14/01/2022 14/01/2022 1
4 2 antibiotic 14/01/2022 15/01/2022 1
5 2 aspirin 14/01/2022 14/01/2022 1