I am working with a dataset of about 2000 rows and 22 columns. I need to assign a score of 0-4 into a new column, based on the contents of two others. Here's the example:
SAMP_SMPT_USER_REFERENCE | MEAS_ANAL_METH_CODE | oP.Res | TP.Res | OP_METH | TP_METH |
---|---|---|---|---|---|
0031 | 21 | 0.02 | 0.04 | ||
0032 | 24 | 0.12 | 0.003 |
(with apologies for how the database generated that heading)
I am trying to build a loop that looks at MEAS_ANAL_METH_CODE and oP.Res. First it will see if MEAS_ANAL_METH_CODE is 21 or 24, then based on that answer it will look at the value in oP.Res and add a category (0,1,2,3,4) to the OP_METH column. Then it will so it again for TP.Res, looking at MEAS_ANAL_METH_CODE, TP.Res and assigning a value to TP_Meth.
I'm new at loops!
This is what I have tried:
for(oP.Res in Merged){
if(Merged$MEAS_ANAL_METH_CODE == 24 && Merged$oP.Res < 0.01) {
"0"
} else if(Merged$MEAS_ANAL_METH_CODE == 24 && Merged$oP.Res >= 0.01 && Merged$oP.Res < 0.0375) {
"1"
} else if(Merged$MEAS_ANAL_METH_CODE == 24 && Merged$oP.Res >= 0.0375 && Merged$oP.Res < 0.0725) {
"2"
} else if(Merged$MEAS_ANAL_METH_CODE == 24 && Merged$oP.Res >= 0.0725 && Merged$oP.Res < 0.1) {
"3"
} else if(Merged$MEAS_ANAL_METH_CODE == 24 && Merged$oP.Res >= 0.1) {
"4"
} else if(Merged$MEAS_ANAL_METH_CODE == 21 && Merged$oP.Res < 0.01) {
"4"
} else if(Merged$MEAS_ANAL_METH_CODE == 21 && Merged$oP.Res >= 0.01 && Merged$oP.Res < 0.0375) {
"3"
} else if(Merged$MEAS_ANAL_METH_CODE == 21 && Merged$oP.Res >= 0.0375 && Merged$oP.Res < 0.0725) {
"2"
} else if(Merged$MEAS_ANAL_METH_CODE == 21 && Merged$oP.Res >= 0.0725 && Merged$oP.Res < 0.1) {
"1"
} else {"0"}
}
I can see there is probably an issue about R not knowing where to put the outputs, but I cannot figure it out.
CodePudding user response:
A solution using loops and a lookup list.
First store the cut breaks and labels for each code in a list.
tmp=list(
"21"=list(
"brk"=c(0,0.01,0.0375,0.0725,0.1,1),
"lab"=0:4
),
"24"=list(
"brk"=c(0,0.01,0.0375,0.0725,0.1,1),
"lab"=4:0
)
)
Then loop over the columns of interest and for each code apply the cut function.
for(cc in c("oP.Res","TP.Res")) {
Merged[paste0(cc,"_cut")]=NA
for (ctg in unique(Merged$MEAS_ANAL_METH_CODE)) {
Merged[Merged$MEAS_ANAL_METH_CODE==ctg,paste0(cc,"_cut")]=
as.character(
cut(
Merged[Merged$MEAS_ANAL_METH_CODE==ctg,cc],
tmp[[as.character(ctg)]][["brk"]],
tmp[[as.character(ctg)]][["lab"]]
)
)
}
}
CodePudding user response:
another option, taking advantage of the readability of {dplyr} pipes:
given the example dataframe merged ...
merged <- structure(list(SAMP_SMPT_USER_REFERENCE = 31:32, MEAS_ANAL_METH_CODE = c(21L,
24L), oP.Res = c(0.02, 0.12), TP.Res = c(0.04, 0.003), OP_METH = c(NA,
NA), TP_METH = c(NA, NA)), class = "data.frame", row.names = 1:2)
## > merged
## SAMP_SMPT_USER_REFERENCE MEAS_ANAL_METH_CODE oP.Res TP.Res OP_METH TP_METH
## 1 31 21 0.02 0.040 NA NA
## 2 32 24 0.12 0.003 NA NA
... you can apply the same classification rule across multiple columns (and obtain specifically named columns) like so:
library(dplyr)
merged %>%
mutate(across(c(oP.Res, TP.Res),
.fns = function(result) {
cat_borders <- c(10, 375, 725, 1000, Inf) / 1000
cat = findInterval(result, cat_borders)
## category borders for method 21 are the reverse
## of those for method 24, we can mirror by subtracting from 5:
ifelse(MEAS_ANAL_METH_CODE == 24, cat, 5 - cat)
},
.names = "OP_METH_CAT_{.col}"
)
)
... result:
## SAMP_SMPT_USER_REFERENCE MEAS_ANAL_METH_CODE oP.Res TP.Res OP_METH TP_METH
## 1 31 21 0.02 0.040 NA NA
## 2 32 24 0.12 0.003 NA NA
## OP_METH_CAT_oP.Res OP_METH_CAT_TP.Res
## 1 4 4
## 2 1 0