I am working with medical claims data to create inpatient episodes. There isn't an 'episode identifier' column within the dataset. My intention is to create a unique identifier for each episode to tie the claims to, which I can handle after I can properly identify the correct admissions and discharge dates for each episode. To keep it simple, here is a table of fake data that reflects a situation that I'm struggling with:
Patient ID | Admitted Date | Discharge Date |
---|---|---|
810 | 2020-12-15 | 2020-12-16 |
810 | 2021-06-17 | 2021-06-19 |
810 | 2021-06-19 | 2021-06-27 |
810 | 2021-06-27 | 2021-07-03 |
With this example dataframe, the first row shows a simple inpatient episode. Lines two through four have admission dates and discharge dates that are tied together. This is due to patients switching hospital divisions, initiating a new REV code.
I had originally used an ifelse statement that inherently failed. I used that without thinking of situations like this, where there are more two lines needed to be grouped as an episode.
Does anyone have any recommendations for packages/resources to use in order to turn lines 2-4 into a single row saying
Patient ID | Admitted Date | Discharge Date |
---|---|---|
810 | 2020-12-15 | 2020-12-16 |
810 | 2021-06-17 | 2021-07-03 |
Thanks! Let me know if any further explanation is needed.
CodePudding user response:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
tribble(
~patient, ~admitted, ~discharge,
810, "2020-12-15", "2020-12-16",
810, "2021-06-17", "2021-06-19",
810, "2021-06-19", "2021-06-27",
810, "2021-06-27", "2021-07-03"
) |>
mutate(across(-patient, ymd),
group = if_else(discharge == lead(admitted) | admitted == lag(discharge), 1, 0)) |>
group_by(patient, group) |>
summarise(admitted = first(admitted), discharge = last(discharge)) |>
arrange(admitted)
#> `summarise()` has grouped output by 'patient'. You can override using the
#> `.groups` argument.
#> # A tibble: 2 × 4
#> # Groups: patient [1]
#> patient group admitted discharge
#> <dbl> <dbl> <date> <date>
#> 1 810 NA 2020-12-15 2020-12-16
#> 2 810 1 2021-06-17 2021-07-03
Created on 2022-05-10 by the reprex package (v2.0.1)
CodePudding user response:
Here is a way with package igraph
. It creates the directed graph of the dates columns, gets its connected components and uses these components to split the dates. Then keeps the first admission date and the last discharge date.
df1 <- read.table(text = "
'Patient ID' 'Admitted Date' 'Discharge Date'
810 2020-12-15 2020-12-16
810 2021-06-17 2021-06-19
810 2021-06-19 2021-06-27
810 2021-06-27 2021-07-03
", header = TRUE, check.names = FALSE)
df1[-1] <- lapply(df1[-1], as.Date)
suppressPackageStartupMessages(library(igraph))
g <- graph_from_data_frame(df1[-1])
cl <- components(g)
sp <- split(names(cl$membership), cl$membership)
new <- apply(df1[-1], 1, \(x) {
which(sapply(sp, \(y) all(x %in% y)))
})
result <- by(df1, new, \(x) {
data.frame('Patient ID' = x[1, 1, drop = TRUE],
'Admitted Date' = x[1, 2, drop = TRUE],
'Discharge date' = x[nrow(x), 3, drop = TRUE])
})
result <- do.call(rbind, result)
result
#> Patient.ID Admitted.Date Discharge.date
#> 1 810 2020-12-15 2020-12-16
#> 2 810 2021-06-17 2021-07-03
rm(sp, new) # final clean-up
Created on 2022-05-10 by the reprex package (v2.0.1)
CodePudding user response:
There's probably some built-in dplyr function that helps but I think I've found this "ugly" way to tackle the problem:
check_if_tied_to_next_row <- function(df, row_number) {
if(df[row_number, "Discharge Date"] == df[row_number 1, "Admitted Date"] & df[row_number, "Patient ID"] == df[row_number 1, "Patient ID"]) {
return(TRUE)
} else {
return(FALSE)
}
}
merge_consecutive_rows <- function(df, row_number) {
df[row_number, "Discharge Date"] <- df[row_number 1, "Discharge Date"]
df <- remove_row(df, row_number 1)
return(df)
}
remove_row <- function(df, row_number) {
return(df[-row_number,])
}
n <- nrow(df) - 1 #Nothing to merge into the last row!
for(i in 1:n) {
while(check_if_tied_to_next_row(df, i)) {
df <- merge_consecutive_rows(df, i)
}
}
CodePudding user response:
Here's another option, using dplyr
and lubridate
.
df <- read.table(
text = "
Patient ID Admitted Date Discharge Date
810 2020-12-15 2020-12-16
810 2021-06-17 2021-06-19
810 2021-06-19 2021-06-27
810 2021-06-27 2021-07-03
",
sep = "\t", header = TRUE)
library(dplyr)
library(lubridate)
df %>%
mutate(date_difference = ymd(Admitted.Date) - lag(ymd(Discharge.Date))) %>%
mutate(
is_start = ifelse(date_difference > 0 | is.na(date_difference), TRUE, FALSE),
is_end = ifelse(lead(date_difference) > 0 | is.na(lead(date_difference)), TRUE, FALSE)
) %>%
filter(is_start | is_end) %>%
mutate(
is_admitted = Admitted.Date,
is_discharged = case_when(
is_end ~ Discharge.Date,
!is_end ~ lead(Discharge.Date)
)
) %>%
na.omit() %>%
select(Patient.ID, Admitted.Date = is_admitted, Discharge.Date = is_discharged)
CodePudding user response:
This is exactly what I created the ivs package for. It allows you to work with interval vectors like what you have here. You can solve this using iv_groups()
, which computes the non-overlapping "groups" of stay dates per patient.
library(dplyr)
library(ivs)
df <- tribble(
~Patient.ID, ~Admitted.Date, ~Discharge.Date,
810L, "2020-12-15", "2020-12-16",
810L, "2021-06-17", "2021-06-19",
810L, "2021-06-19", "2021-06-27",
810L, "2021-06-27", "2021-07-03"
)
df <- mutate(df, Admitted.Date = as.Date(Admitted.Date))
df <- mutate(df, Discharge.Date = as.Date(Discharge.Date))
df
#> # A tibble: 4 × 3
#> Patient.ID Admitted.Date Discharge.Date
#> <int> <date> <date>
#> 1 810 2020-12-15 2020-12-16
#> 2 810 2021-06-17 2021-06-19
#> 3 810 2021-06-19 2021-06-27
#> 4 810 2021-06-27 2021-07-03
# Create an interval vector combining the hospital stay as:
# [Admitted.Date, Discharge.Date)
df <- df %>%
mutate(Stay = iv(Admitted.Date, Discharge.Date), .keep = "unused")
df
#> # A tibble: 4 × 2
#> Patient.ID Stay
#> <int> <iv<date>>
#> 1 810 [2020-12-15, 2020-12-16)
#> 2 810 [2021-06-17, 2021-06-19)
#> 3 810 [2021-06-19, 2021-06-27)
#> 4 810 [2021-06-27, 2021-07-03)
# Assuming you have multiple patients, we will group by `Patient.ID`.
# Then compute the non-overlapping interval "groups" per patient with `iv_groups()`
df %>%
group_by(Patient.ID) %>%
summarise(Stay = iv_groups(Stay), .groups = "drop")
#> # A tibble: 2 × 2
#> Patient.ID Stay
#> <int> <iv<date>>
#> 1 810 [2020-12-15, 2020-12-16)
#> 2 810 [2021-06-17, 2021-07-03)
# You can also see which "group" each stay fell in by using `iv_identify_group()`
df %>%
group_by(Patient.ID) %>%
mutate(Group = iv_identify_group(Stay))
#> # A tibble: 4 × 3
#> # Groups: Patient.ID [1]
#> Patient.ID Stay Group
#> <int> <iv<date>> <iv<date>>
#> 1 810 [2020-12-15, 2020-12-16) [2020-12-15, 2020-12-16)
#> 2 810 [2021-06-17, 2021-06-19) [2021-06-17, 2021-07-03)
#> 3 810 [2021-06-19, 2021-06-27) [2021-06-17, 2021-07-03)
#> 4 810 [2021-06-27, 2021-07-03) [2021-06-17, 2021-07-03)