Summarise medication dates based on instance of prescribing-CodePudding

I have a long dataset (25k rows) where each row represents a dose of a certain medication given to a person (id) on a given day (adm_date).

I'm now trying to summarise this dataset so that each row represents a course of a med for a given id, with columns indicating start date and end date. Each course can last any number of days.

    library(dplyr)
    df <- tribble(~id, ~med,  ~adm_date,   ~start, ~end,
            1,  "antibiotic", "13/01/2022", TRUE, NA,
            1,  "antibiotic", "14/01/2022", NA,   TRUE,
            1,  "antibiotic", "21/01/2022", TRUE, NA,
            1,  "antibiotic", "22/02/2022", NA,   NA,
            1,  "antibiotic", "23/02/2022", NA,   TRUE,
            1,  "aspirin",    "14/01/2022", TRUE, TRUE,
            2,  "antibiotic", "14/01/2022", TRUE, NA,
            2,  "antibiotic", "15/01/2022", NA,   TRUE,
            2,  "aspirin",    "14/01/2022", TRUE, TRUE)

Intended output:

id	med	course	course_start	course_end
1	antibiotic	1	13/01/2022	14/01/2022
1	antibiotic	2	21/01/2022	23/02/2022
1	aspirin	1	14/01/2022	14/01/2022
2	antibiotic	1	14/01/2022	15/01/2022
2	aspirin	1	14/01/2022	14/01/2022

I can't simply use group_by(id, med) %>% summarise(course_start = min(adm_date), course_end = max(adm_date) because a medication can be prescribed in more than one course (e.g. two instances for the same antibiotic: 13-14 Jan, 21-22 Jan; the code above would merge the two into 13-22 Jan).

My approach is:

I coded columns start and end to identify the start and end of each course (Done - see sample dataset).
Using these to create a course (instance) column. (Don't know how to do this).
df |> group_by(id, med, course) |> summarise(course_start = min(adm_date), course_end = max(adm_date) (This should work).

(Of course, happy with any other approach).

(PS: this is similar to this question, however I don't have another factor that separates instances, so I can't use Stefano's brilliant solution)

CodePudding user response：

df %>% 
  filter(!(is.na(start) & is.na(end))) %>% 
  group_by(med, id) %>% 
  mutate(course_start = if_else(start == T, adm_date, NA_character_),
         course_end = if_else(is.na(end), lead(adm_date), course_start)) %>% 
  drop_na(course_start) %>% 
  select(id, med, starts_with("course")) %>% 
  mutate(course = row_number(), .after = med) %>% 
  ungroup

# A tibble: 5 x 5
     id med        course course_start course_end
  <dbl> <chr>       <int> <chr>        <chr>     
1     1 antibiotic      1 13/01/2022   14/01/2022
2     1 antibiotic      2 21/01/2022   23/02/2022
3     1 aspirin         1 14/01/2022   14/01/2022
4     2 antibiotic      1 14/01/2022   15/01/2022
5     2 aspirin         1 14/01/2022   14/01/2022

CodePudding user response：

First group_by(id, med) so that they're treated as a group
Create a course_end column by ifelse. If end is NA, set the next row of adm_date as the course_end. If it's not NA, use the adm_date.
Rename adm_date to start_date
Remove rows that contain NA in the start column
Remove start and end columns
Create a course column by forcing med into numeric column and calculate the cumsum of it.

library(tidyverse)

df %>% 
  group_by(id, med) %>% 
  mutate(course_end = ifelse(is.na(end), lead(adm_date), adm_date)) %>% 
  rename("start_date" = "adm_date") %>% 
  drop_na(start) %>% 
  select(-start, -end) %>% 
  mutate(course = cumsum(as.numeric(as.factor(med))))

# A tibble: 5 × 5
# Groups:   id, med [4]
     id med        start_date course_end course
  <dbl> <chr>      <chr>      <chr>       <dbl>
1     1 antibiotic 13/01/2022 14/01/2022      1
2     1 antibiotic 21/01/2022 22/02/2022      2
3     1 aspirin    14/01/2022 14/01/2022      1
4     2 antibiotic 14/01/2022 15/01/2022      1
5     2 aspirin    14/01/2022 14/01/2022      1