within a dataframe find rows with the same id and compare values in another column to decide which r-CodePudding

I'm working with a dataset and I need to find wherever there are matching PATIENT_ID values and then compare compare the dx_specific value between those rows and only keep the row with the dx_specific value highest in the hierarchy. In the end there should only be one row per PATIENT_ID and the dx_specific value should be the one with the lowest value in the hierarchy.

   a <- data.frame(PATIENT_ID =  c(1, 1, 2, 2, 3, 3, 3, 4),
                  CREATE_DATE = as.Date(c("2016-10-02", "2016-10-02", "2018-09-04", "2018-09-04", 
                    "2019-07-29", "2019-07-29", "2019-07-29", "2020-06-01")),
                  dx_broad = c("Bipolar", "MDD", "Bipolar", "Bipolar", "Other", "Other", "MD phys", "MDD"),
                  dx_specific = c("Bipolar 1", "MDD recurrent", "Bipolar 1", "Bipolar 2", "Anxiety", 
                    "Other", "MD phys", "MDD recurrent"))

dput(a)
structure(list(PATIENT_ID = c(1, 1, 2, 2, 3, 3, 3), CREATE_DATE = structure(c(17076, 
17076, 17778, 17778, 18106, 18106, 18106), class = "Date"), dx_broad = c("Bipolar", 
"MDD", "Bipolar", "Bipolar", "Other", "Other", "MD phys"), dx_specific = c("Bipolar 1", 
"MDD recurrent", "Bipolar 1", "Bipolar 2", "Anxiety", "Other", 
"MD phys")), class = "data.frame", row.names = c(NA, -7L))

Hierarchy of dx_specific values:

Bipolar 1
Bipolar 2
MDD recurrent
Anxiety
Other
MD phys

Eventually I reached this solution:

solution <- a %>%
  group_by(PATIENT_ID) %>%
  mutate(weight = case_when(dx_specific == "Bipolar 1" ~ 1,
                            dx_specific == "Bipolar 2" ~ 2,
                            dx_specific == "MDD recurrent" ~ 3, 
                            dx_specific == "Anxiety" ~ 4,
                            dx_specific == "Other" ~ 5,
                            dx_specific == "MD phys" ~ 6))
  slice(which.min(weight)) %>%
  select(-weight)

However, I was told despite working this was not an 'elegant' solution to the problem. I am very new to R so I was interested to hear what other solutions people might have. Thank you for your time.

CodePudding user response：

I would probably code dx_specific as a factor. slice_min would then select the values based on the factor level after grouping.

library("dplyr")

dx_weights <- c("Bipolar 1", "Bipolar 2", "MDD recurrent", "Anxiety", "Other", "MD phys")

a |>
  mutate(dx_specific=factor(dx_specific, levels=dx_weights)) |>
  group_by(PATIENT_ID) |>
  slice_min(dx_specific) |>
  ungroup()

# A tibble: 4 × 4
  PATIENT_ID CREATE_DATE dx_broad dx_specific  
       <dbl> <date>      <chr>    <fct>        
1          1 2016-10-02  Bipolar  Bipolar 1    
2          2 2018-09-04  Bipolar  Bipolar 1    
3          3 2019-07-29  Other    Anxiety      
4          4 2020-06-01  MDD      MDD recurrent

CodePudding user response：

Using data.table:

dx <- c('Bipolar 1', 'Bipolar 2', 'MDD recurrent', 'Anxiety', 'Other', 'MD phys')
library(data.table)
setDT(a)[, dx_specific:=factor(dx_specific, levels=dx)]
a[, .SD[which.min(dx_specific)], by=.(PATIENT_ID)]

##    PATIENT_ID CREATE_DATE dx_broad   dx_specific
## 1:          1  2016-10-02  Bipolar     Bipolar 1
## 2:          2  2018-09-04  Bipolar     Bipolar 1
## 3:          3  2019-07-29    Other       Anxiety
## 4:          4  2020-06-01      MDD MDD recurrent

This groups records by PATIENT_ID and for each group extracts the subset of columns which row has the smallest value of dx_specific ( .SD[which.min(dx_specific)] ). .SD represents a subset of columns, the default being all columns except those included in the by=... clause (so, all columns).