I'm working with a dataset and I need to find wherever there are matching PATIENT_ID values and then compare compare the dx_specific value between those rows and only keep the row with the dx_specific value highest in the hierarchy. In the end there should only be one row per PATIENT_ID and the dx_specific value should be the one with the lowest value in the hierarchy.
a <- data.frame(PATIENT_ID = c(1, 1, 2, 2, 3, 3, 3, 4),
CREATE_DATE = as.Date(c("2016-10-02", "2016-10-02", "2018-09-04", "2018-09-04",
"2019-07-29", "2019-07-29", "2019-07-29", "2020-06-01")),
dx_broad = c("Bipolar", "MDD", "Bipolar", "Bipolar", "Other", "Other", "MD phys", "MDD"),
dx_specific = c("Bipolar 1", "MDD recurrent", "Bipolar 1", "Bipolar 2", "Anxiety",
"Other", "MD phys", "MDD recurrent"))
dput(a)
structure(list(PATIENT_ID = c(1, 1, 2, 2, 3, 3, 3), CREATE_DATE = structure(c(17076,
17076, 17778, 17778, 18106, 18106, 18106), class = "Date"), dx_broad = c("Bipolar",
"MDD", "Bipolar", "Bipolar", "Other", "Other", "MD phys"), dx_specific = c("Bipolar 1",
"MDD recurrent", "Bipolar 1", "Bipolar 2", "Anxiety", "Other",
"MD phys")), class = "data.frame", row.names = c(NA, -7L))
Hierarchy of dx_specific values:
- Bipolar 1
- Bipolar 2
- MDD recurrent
- Anxiety
- Other
- MD phys
Eventually I reached this solution:
solution <- a %>%
group_by(PATIENT_ID) %>%
mutate(weight = case_when(dx_specific == "Bipolar 1" ~ 1,
dx_specific == "Bipolar 2" ~ 2,
dx_specific == "MDD recurrent" ~ 3,
dx_specific == "Anxiety" ~ 4,
dx_specific == "Other" ~ 5,
dx_specific == "MD phys" ~ 6))
slice(which.min(weight)) %>%
select(-weight)
However, I was told despite working this was not an 'elegant' solution to the problem. I am very new to R so I was interested to hear what other solutions people might have. Thank you for your time.
CodePudding user response:
I would probably code dx_specific
as a factor. slice_min
would then select the values based on the factor level after grouping.
library("dplyr")
dx_weights <- c("Bipolar 1", "Bipolar 2", "MDD recurrent", "Anxiety", "Other", "MD phys")
a |>
mutate(dx_specific=factor(dx_specific, levels=dx_weights)) |>
group_by(PATIENT_ID) |>
slice_min(dx_specific) |>
ungroup()
# A tibble: 4 × 4
PATIENT_ID CREATE_DATE dx_broad dx_specific
<dbl> <date> <chr> <fct>
1 1 2016-10-02 Bipolar Bipolar 1
2 2 2018-09-04 Bipolar Bipolar 1
3 3 2019-07-29 Other Anxiety
4 4 2020-06-01 MDD MDD recurrent
CodePudding user response:
Using data.table
:
dx <- c('Bipolar 1', 'Bipolar 2', 'MDD recurrent', 'Anxiety', 'Other', 'MD phys')
library(data.table)
setDT(a)[, dx_specific:=factor(dx_specific, levels=dx)]
a[, .SD[which.min(dx_specific)], by=.(PATIENT_ID)]
## PATIENT_ID CREATE_DATE dx_broad dx_specific
## 1: 1 2016-10-02 Bipolar Bipolar 1
## 2: 2 2018-09-04 Bipolar Bipolar 1
## 3: 3 2019-07-29 Other Anxiety
## 4: 4 2020-06-01 MDD MDD recurrent
This groups records by PATIENT_ID
and for each group extracts the subset of columns which row has the smallest value of dx_specific
( .SD[which.min(dx_specific)]
). .SD
represents a subset of columns, the default being all columns except those included in the by=...
clause (so, all columns).