I could use some help with a tidyverse solution to this question.
I'm working with a large dataset that has 20 binary cancer outcomes (cancer_{cancertype}), as well as corresponding ages ({cancertype}_age). Some individuals are missing cancer phenotype information - I would like to set the age variables for each cancer type to NA if the cancer phenotype is missing. I've been trying to implement mutate(across()), but am having some issues specifying the appropriate arguments.
# load tidyverse lib
library(tidyverse)
# Set seed for reproducibility
set.seed(42)
# generate dataframe
cancer_ds <- data.frame(id = 1000:1009,
cancer_a = rep(0:1, length = 10),
cancer_b = c(rep(0, 3), NA, NA, 1, NA, rep(1, 3)),
cancer_c = c(rep(0:1, each = 2, len = 6), rep(NA, 4)),
a_age = sample(30:60, 10, FALSE),
b_age = sample(30:60, 10, FALSE),
c_age = sample(30:60, 10, FALSE)
)
cancer_ds
cancer_list <- paste("cancer",letters[seq(1:3)], sep = "_" )
cancer_list
# attempted code
out_ds <- cancer_ds %>%
mutate(across(ends_with("age"), ~replace(is.na(cancer_list)))
# expected output dataset
out_ds_exp <- cancer_ds %>%
mutate(b_age = ifelse(b_age %in% c("43", "49", "47"), NA, b_age),
c_age = ifelse(c_age %in% c("49", "31", "37", "32"), NA, c_age))
out_ds_exp
Any help is appreciated! Thanks.
CodePudding user response:
Here is an option.
cancer_ds %>%
rename_with(~ str_replace_all(.x, "([a-z])_([a-z]{2,})", "\\2_\\1")) %>%
pivot_longer(-id, names_to = c(".value", "grp"), names_sep = "_") %>%
mutate(age = if_else(is.na(cancer), NA_integer_, age)) %>%
pivot_wider(names_from = grp, values_from = c(cancer, age))
## A tibble: 10 x 7
# id cancer_a cancer_b cancer_c age_a age_b age_c
# <int> <dbl> <dbl> <dbl> <int> <int> <int>
# 1 1000 0 0 0 46 33 54
# 2 1001 1 0 0 34 54 56
# 3 1002 0 0 1 30 34 33
# 4 1003 1 NA 1 54 NA 34
# 5 1004 0 NA 0 39 NA 42
# 6 1005 1 1 0 33 55 57
# 7 1006 0 NA NA 47 NA NA
# 8 1007 1 1 NA 60 44 NA
# 9 1008 0 1 NA 44 32 NA
#10 1009 1 1 NA 36 38 NA
Explanation: We first fix the inconsistent column names using rename_with
: you have both "<what>_<group>"
(e.g. "cancer_a") and "<group>_<what>"
(e.g. "a_age"); then it's a simple matter of reshaping multiple paired columns from wide to long. We can then replace age
values with NA
s if cancer
is NA
before reshaping back from long to wide.