Home > Enterprise >  Set specific values of a group of variables to NA based on another group of variables
Set specific values of a group of variables to NA based on another group of variables

Time:07-22

I could use some help with a tidyverse solution to this question.

I'm working with a large dataset that has 20 binary cancer outcomes (cancer_{cancertype}), as well as corresponding ages ({cancertype}_age). Some individuals are missing cancer phenotype information - I would like to set the age variables for each cancer type to NA if the cancer phenotype is missing. I've been trying to implement mutate(across()), but am having some issues specifying the appropriate arguments.

# load tidyverse lib
library(tidyverse)

# Set seed for reproducibility
set.seed(42)

# generate dataframe
cancer_ds <- data.frame(id = 1000:1009,
           cancer_a = rep(0:1, length = 10), 
           cancer_b = c(rep(0, 3), NA, NA, 1, NA, rep(1, 3)), 
           cancer_c = c(rep(0:1, each = 2, len = 6), rep(NA, 4)), 
           a_age = sample(30:60, 10, FALSE), 
           b_age = sample(30:60, 10, FALSE), 
           c_age = sample(30:60, 10, FALSE)
           ) 

cancer_ds

cancer_list <- paste("cancer",letters[seq(1:3)], sep = "_" )

cancer_list

# attempted code
out_ds <- cancer_ds %>% 
          mutate(across(ends_with("age"), ~replace(is.na(cancer_list)))

# expected output dataset 
out_ds_exp <- cancer_ds %>% 
          mutate(b_age = ifelse(b_age %in% c("43", "49", "47"), NA, b_age), 
                 c_age = ifelse(c_age %in% c("49", "31", "37", "32"), NA, c_age))

out_ds_exp

Any help is appreciated! Thanks.

CodePudding user response:

Here is an option.

cancer_ds %>%
    rename_with(~ str_replace_all(.x, "([a-z])_([a-z]{2,})", "\\2_\\1")) %>%
    pivot_longer(-id, names_to = c(".value", "grp"), names_sep = "_") %>%
    mutate(age = if_else(is.na(cancer), NA_integer_, age)) %>%
    pivot_wider(names_from = grp, values_from = c(cancer, age))
## A tibble: 10 x 7
#      id cancer_a cancer_b cancer_c age_a age_b age_c
#   <int>    <dbl>    <dbl>    <dbl> <int> <int> <int>
# 1  1000        0        0        0    46    33    54
# 2  1001        1        0        0    34    54    56
# 3  1002        0        0        1    30    34    33
# 4  1003        1       NA        1    54    NA    34
# 5  1004        0       NA        0    39    NA    42
# 6  1005        1        1        0    33    55    57
# 7  1006        0       NA       NA    47    NA    NA
# 8  1007        1        1       NA    60    44    NA
# 9  1008        0        1       NA    44    32    NA
#10  1009        1        1       NA    36    38    NA

Explanation: We first fix the inconsistent column names using rename_with: you have both "<what>_<group>" (e.g. "cancer_a") and "<group>_<what>" (e.g. "a_age"); then it's a simple matter of reshaping multiple paired columns from wide to long. We can then replace age values with NAs if cancer is NA before reshaping back from long to wide.

  • Related