Home > Software engineering >  Mapping non-numeric factor to choose higher value between two columns in R
Mapping non-numeric factor to choose higher value between two columns in R

Time:11-25

I have a dataframe with two column: PathGroupStage, ClinGroupStage. I want to create a new column, OutputStage, that chooses the higher stage.

Valid value of stage: I, IA, IB, II, IIA, IIB, III, IIIA, IIIB, IIIC ,IV, IVA, IVB, IVC, Unknown.

  • If both stages have values, then use the highest, e.g., IIIB > IIIA > III
  • If one is missing and the other has value, the use the one with value
  • If both are missing or unknown, then .= unknown

How would I derive the OutputStage variable comparing the non-numeric values from the two columns? I am thinking I need to factor levels but how would I compare the factors between different columns?

Here is the sample dataset:

   PathGroupStage       ClinGroupStage
1              II                 <NA>
2               I                   IA
3             IVB                  IVB
4            IIIA Unknown/Not Reported
5               I                  III
6              II                 <NA>
7            IIIA                  IIB
8              II                   II
9            <NA>                 <NA>
10           IIIB Unknown/Not Reported

 df <- structure(list(PathGroupStage = c("II", "I", "IVB", "IIIA", "I", 
    "II", "IIIA", "II", NA, "IIIB"), ClinGroupStage = c(NA, "IA", 
    "IVB", "Unknown/Not Reported", "III", NA, "IIB", "II", NA, "Unknown/Not Reported"
    )), row.names = c(NA, 10L), class = "data.frame")

CodePudding user response:

One option could be:

stages <- c("Unknown/Not Reported", "I", "IA", "IB", "II", "IIA", "IIB", "III", "IIIA", "IIIB", "IIIC" ,"IV", "IVA", "IVB", "IVC")

df %>%
    mutate(across(everything(), ~ factor(., levels = stages, ordered = TRUE)),
           OutputStage = pmax(PathGroupStage, ClinGroupStage, na.rm = TRUE))

   PathGroupStage       ClinGroupStage OutputStage
1              II                 <NA>          II
2               I                   IA          IA
3             IVB                  IVB         IVB
4            IIIA Unknown/Not Reported        IIIA
5               I                  III         III
6              II                 <NA>          II
7            IIIA                  IIB        IIIA
8              II                   II          II
9            <NA>                 <NA>        <NA>
10           IIIB Unknown/Not Reported        IIIB

CodePudding user response:

df <- structure(
    list(
        PathGroupStage = c("II", "I", "IVB", "IIIA", "I", "II", "IIIA", "II", NA, "IIIB"),
        ClinGroupStage = c(NA, "IA", "IVB", "Unknown/Not Reported", "III", NA, "IIB", "II", NA, "Unknown/Not Reported")
    ),
    row.names = c(NA, 10L), class = "data.frame"
) 

# The variables are not yet factors as far as R is concerned as you can 
# see from the tibble print method
df %>% as_tibble()

stages <- c("I", "IA", "IB", "II", "IIA", "IIB", "III", "IIIA", "IIIB", "IIIC" ,"IV", "IVA", "IVB", "IVC", "Unknown/Not Reported")

df %>%
    as_tibble() %>%
    dplyr::mutate(
        # if we make them ordered factors then they now have values you can do a mathematical operation on
        PathGroupStage = factor(PathGroupStage, levels = stages, ordered = TRUE),
        ClinGroupStage = factor(ClinGroupStage, levels = stages, ordered = TRUE),
        # case when is like a more general if_else() with multiple conditions
        # of the form: logical test ~ result if true
        OutputStage = case_when(
            (is.na(ClinGroupStage) | ClinGroupStage == "Unknown/Not Reported") & 
            (is.na(PathGroupStage) | PathGroupStage == "Unknown/Not Reported") ~ 
                factor("Unknown/Not Reported", levels = stages, ordered = TRUE),
            is.na(PathGroupStage) ~ ClinGroupStage,
            is.na(ClinGroupStage) ~ PathGroupStage,
            PathGroupStage >= ClinGroupStage ~ PathGroupStage,
            ClinGroupStage >= PathGroupStage ~ ClinGroupStage
        )
    )
  • Related