I have the following data frame:
dat <- structure(list(model_name = c("Random Forest", "XGBoost", "XGBoost-reg",
"Null model", "Plain LM", "Elastic LM", "LM-pep.charge", "LM-rf.10vip"
), RMSE = c(0.853, 0.886, 0.719, 2.41, 16.6, 0.731, 1.16, 1.03
), MAE = c(0.545, 0.708, 0.589, 1.98, 8.6, 0.588, 0.874, 0.729
), `R^2` = c(0.806, 0.865, 0.915, NA, 0.0645, 0.927, 0.8, 0.822
), ccc = c(0.89, 0.928, 0.951, 0, 0.0685, 0.945, 0.847, 0.901
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))
It looks like this:
model_name RMSE MAE `R^2` ccc
<chr> <dbl> <dbl> <dbl> <dbl>
1 Random Forest 0.853 0.545 0.806 0.89
2 XGBoost 0.886 0.708 0.865 0.928
3 XGBoost-reg 0.719 0.589 0.915 0.951
4 Null model 2.41 1.98 NA 0
5 Plain LM 16.6 8.6 0.0645 0.0685
6 Elastic LM 0.731 0.588 0.927 0.945
7 LM-pep.charge 1.16 0.874 0.8 0.847
8 LM-rf.10vip 1.03 0.729 0.822 0.901
It stores the evaluation metrics for 8 prediction models. My goal is to select the top-performing model that consistently excels in the majority of evaluations.
By manually evaluating the metrics, I determined the top performing model this way:
Metrics -> Top 1
-----------------
RMSE -> XGBoost-reg
MAE -> RF
R^2 -> Elastic LM
CCC -> XGBoost-reg
# Therefore, the winner is XGBoost-reg
It's worth noting that RMSE and MAE are error measures, with lower values indicating better performance, while R^2 and CCC are correlation measures, with higher values indicating better performance.
How can I do this with R?
CodePudding user response:
We may either convert the data into 'long' format, do a group by 'name' and get the row with lowest value of 'value1' (after modifying the case for R^2
and ccc
- multiplying by -1), then get the frequency count
and select the first row
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(cols = -model_name, values_drop_na = TRUE) %>%
mutate(value1 = case_when(name %in% c("R^2", "ccc")~ value * -1,
TRUE ~ value)) %>%
group_by(name) %>%
slice_min(n = 1, value1) %>%
ungroup %>%
count(model_name, sort = TRUE) %>%
slice_head(n = 1)
-output
# A tibble: 1 × 2
model_name n
<chr> <int>
1 XGBoost-reg 2
Or do the summarise to select the model_name from the numeric columns based on the min/max index and then get the count
after converting to 'long' format
dat %>%
summarise(across(where(is.numeric),
~ if(cur_column() %in% c("R^2", "ccc"))
model_name[which.max(.x)] else model_name[which.min(.x)])) %>%
pivot_longer(cols = everything(), names_to = NULL) %>%
count(value, sort = TRUE) %>%
slice_head(n = 1)
-output
# A tibble: 1 × 2
value n
<chr> <int>
1 XGBoost-reg 2
Or with base R
names(which.max(table(dat$model_name[max.col(t(replace(dat[-1],
is.na(dat[-1]), -Inf) * list(-1, -1, 1, 1)), 'first')])))
[1] "XGBoost-reg"