I have a dataset with over 200 columns and 2 million obs. In this data set the are 11 columns (all of them start with "DIAG") with alphanumeric categories and I want to create a new column that if the category I'm interested in is between that particular range, the column would receive 1, or else, 0.
I tried something like this, but it doesn't work:
data$GROUP_12 <- ifelse(across(contains('DIAG') >= "A15" & <= "A19",
ifelse(across(contains('DIAG') >= "A50" & <= "A64", 1, 0))))
The ranges I'm interest in are: A15-A19, A50-A64, A80-B09, B15-B19, B25-B34. So I would have to do this for each range, but all of them being added to GROUP_12 column.
Edit with reproductible data:
data <- data.frame(DIAG_PRINC = c("A25", "B32", "O90"),
DIAG_SECUN = c("A16", "Y55", "K97"),
DIAGSEC1 = c("B15", "J55", "Y97"),
DIAGSEC2 = c("L16", "B55", "A97"),
DIAGSEC3 = c("W10", "S57", "T44"),
DIAGSEC4 = c("A64", "A16", "Y55"),
DIAGSEC5 = c("A80", "D33", "V12"),
DIAGSEC6 = c("M16", "N55", "X97"),
DIAGSEC7 = c("B16", "Y57", "O58"),
DIAGSEC8 = c("V45", "C23", "Q97"),
DIAGSEC9 = c("F98", "R44", "A54"))
Any suggestions?
CodePudding user response:
Based on the example provided and the logic mentioned, we may use if_any
(if any of the columns in a particular row have the value) or if_all
(if all the columns in the selected columns in a particular row have the value) to return TRUE/FALSE
. Then wrap with
or as.integer
to coerce the logical to binary
library(dplyr)
vals_range <- c(sprintf("Ad", c(15:19, 50:64, 80)),
sprintf("Bd", c(1:9, 15:19, 25:34)))
data <- data %>%
mutate(GROUP_12 = (if_any(starts_with('DIAG'), ~ .x %in% vals_range)))
-output
data
DIAG_PRINC DIAG_SECUN DIAGSEC1 DIAGSEC2 DIAGSEC3 DIAGSEC4 DIAGSEC5 DIAGSEC6 DIAGSEC7 DIAGSEC8 DIAGSEC9 GROUP_12
1 A25 A16 B15 L16 W10 A64 A80 M16 B16 V45 F98 1
2 B32 Y55 J55 B55 S57 A16 D33 N55 Y57 C23 R44 1
3 O90 K97 Y97 A97 T44 Y55 V12 X97 O58 Q97 A54 1