I have a large dataframe with 4 different variables ($name), that have many different values ($expl.val). Based on a model I made, I also have thirty separate predictions of ten different models, 3 runs for each ($pred.name). For each prediction I have a value (pred.value).
name expl.val pred.name pred.val
var1 0 RUN1_MODEL1 0.57727
var1 0 RUN2_MODEL1 0.561696
var1 0 RUN3_MODEL1 0.553354
var1 0 RUN1_MODEL2 0.719538
var1 0 RUN2_MODEL2 0.695912
var1 0 RUN3_MODEL2 0.729262
var2 5 RUN1_MODEL1 0.694463
var2 5 RUN2_MODEL1 0.699222
var2 5 RUN3_MODEL1 0.695147
var2 5 RUN1_MODEL4 0.886816
var2 5 RUN2_MODEL4 0.960639
var2 5 RUN3_MODEL4 0.982607
My ultimate goal is to plot response curves (x-axis: $expl.value, y-axis: $pred.value), but as the data frame is now, I would for each variable (%name) have 30 response curves, as I have 30 separate predictions per variable value. This would be messy.
Therefore, I'd like to eliminate the separate prediction runs per model, so that for each variable value I only have 10 predicted values (ten per model instead of thirty).
So the output would, in this case, look like this:
name expl.val pred.name pred.val
var1 0 RUNavg_MODEL1 0.564107
var1 0 RUNavg_MODEL2 0.714904
var2 5 RUNavg_MODEL1 0.696227
var2 5 RUNavg_MODEL4 0.943354
I am unsure how to approach this in Rstudio, because I want to average based on the different runs per different model, but then also separated by the value of the 4 variables on which the predictions depend. My only instinct is that I may want to split $pred.name so I get a new factor variable with 3 levels (RUN1,RUN2,RUN3) that I then somehow need to average not only by MODEL# but also by $expl.val.
CodePudding user response:
You could use
library(stringr)
library(dplyr)
df %>%
group_by(name, expl.val, model = str_replace(pred.name, "RUN\\d _", "RUNavg_")) %>%
summarise(pred.val = mean(pred.val), .groups = "drop")
This returns
# A tibble: 4 x 4
name expl.val model pred.val
<chr> <dbl> <chr> <dbl>
1 var1 0 AllData_RUNavg_ANN 0.564
2 var1 0 AllData_RUNavg_GBM 0.715
3 var2 5 AllData_RUNavg_ANN 0.696
4 var2 5 AllData_RUNavg_GLM 0.943
Data
structure(list(name = c("var1", "var1", "var1", "var1", "var1",
"var1", "var2", "var2", "var2", "var2", "var2", "var2"), expl.val = c(0,
0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5), pred.name = c("AllData_RUN1_ANN",
"AllData_RUN2_ANN", "AllData_RUN3_ANN", "AllData_RUN1_GBM", "AllData_RUN2_GBM",
"AllData_RUN3_GBM", "AllData_RUN1_ANN", "AllData_RUN2_ANN", "AllData_RUN3_ANN",
"AllData_RUN1_GLM", "AllData_RUN2_GLM", "AllData_RUN3_GLM"),
pred.val = c(0.57727, 0.561696, 0.553354, 0.719538, 0.695912,
0.729262, 0.694463, 0.699222, 0.695147, 0.886816, 0.960639,
0.982607)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -12L), spec = structure(list(cols = list(
name = structure(list(), class = c("collector_character",
"collector")), expl.val = structure(list(), class = c("collector_double",
"collector")), pred.name = structure(list(), class = c("collector_character",
"collector")), pred.val = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
CodePudding user response:
Data:
dat <- structure(list(name = c("var1", "var1", "var1", "var1", "var1",
"var1", "var2", "var2", "var2", "var2", "var2", "var2"), expl.val = c(0L,
0L, 0L, 0L, 0L, 0L, 5L, 5L, 5L, 5L, 5L, 5L), pred.name = c("RUN1_MODEL1",
"RUN2_MODEL1", "RUN3_MODEL1", "RUN1_MODEL2", "RUN2_MODEL2", "RUN3_MODEL2",
"RUN1_MODEL1", "RUN2_MODEL1", "RUN3_MODEL1", "RUN1_MODEL4", "RUN2_MODEL4",
"RUN3_MODEL4"), pred.val = c(0.57727, 0.561696, 0.553354, 0.719538,
0.695912, 0.729262, 0.694463, 0.699222, 0.695147, 0.886816, 0.960639,
0.982607)), row.names = c(NA, -12L), class = "data.frame")
A data.table solution:
library(data.table)
setDT(dat)
unique( dat[, .(expl.val,pred.val=mean(pred.val)),
by = .(name, pred.name=sub(".*_","RUNavg_",pred.name)) ] )[,c(1,3,2,4)]
name expl.val pred.name pred.val
1: var1 0 RUNavg_MODEL1 0.5641067
2: var1 0 RUNavg_MODEL2 0.7149040
3: var2 5 RUNavg_MODEL1 0.6962773
4: var2 5 RUNavg_MODEL4 0.9433540
Also, a base R approach:
dat <- as.data.frame(dat)
setNames( aggregate( dat$pred.val,
by=list( dat$name, dat$expl.val, sub(".*_","RUNavg_", dat$pred.name) ),
mean ), colnames(dat) )
name expl.val pred.name pred.val
1 var1 0 RUNavg_MODEL1 0.5641067
2 var2 5 RUNavg_MODEL1 0.6962773
3 var1 0 RUNavg_MODEL2 0.7149040
4 var2 5 RUNavg_MODEL4 0.9433540