Home > Software engineering >  average values in a column based on two factors
average values in a column based on two factors

Time:11-10

I have a large dataframe with 4 different variables ($name), that have many different values ($expl.val). Based on a model I made, I also have thirty separate predictions of ten different models, 3 runs for each ($pred.name). For each prediction I have a value (pred.value).

name    expl.val    pred.name   pred.val
var1    0           RUN1_MODEL1 0.57727
var1    0           RUN2_MODEL1 0.561696
var1    0           RUN3_MODEL1 0.553354
var1    0           RUN1_MODEL2 0.719538
var1    0           RUN2_MODEL2 0.695912
var1    0           RUN3_MODEL2 0.729262
var2    5           RUN1_MODEL1 0.694463
var2    5           RUN2_MODEL1 0.699222
var2    5           RUN3_MODEL1 0.695147
var2    5           RUN1_MODEL4 0.886816
var2    5           RUN2_MODEL4 0.960639
var2    5           RUN3_MODEL4 0.982607

My ultimate goal is to plot response curves (x-axis: $expl.value, y-axis: $pred.value), but as the data frame is now, I would for each variable (%name) have 30 response curves, as I have 30 separate predictions per variable value. This would be messy.

Therefore, I'd like to eliminate the separate prediction runs per model, so that for each variable value I only have 10 predicted values (ten per model instead of thirty).

So the output would, in this case, look like this:

name    expl.val    pred.name   pred.val
var1    0           RUNavg_MODEL1   0.564107
var1    0           RUNavg_MODEL2   0.714904
var2    5           RUNavg_MODEL1   0.696227
var2    5           RUNavg_MODEL4   0.943354

I am unsure how to approach this in Rstudio, because I want to average based on the different runs per different model, but then also separated by the value of the 4 variables on which the predictions depend. My only instinct is that I may want to split $pred.name so I get a new factor variable with 3 levels (RUN1,RUN2,RUN3) that I then somehow need to average not only by MODEL# but also by $expl.val.

CodePudding user response:

You could use

library(stringr)
library(dplyr)

df %>% 
  group_by(name, expl.val, model = str_replace(pred.name, "RUN\\d _", "RUNavg_")) %>% 
  summarise(pred.val = mean(pred.val), .groups = "drop")

This returns

# A tibble: 4 x 4
  name  expl.val model              pred.val
  <chr>    <dbl> <chr>                 <dbl>
1 var1         0 AllData_RUNavg_ANN    0.564
2 var1         0 AllData_RUNavg_GBM    0.715
3 var2         5 AllData_RUNavg_ANN    0.696
4 var2         5 AllData_RUNavg_GLM    0.943

Data

structure(list(name = c("var1", "var1", "var1", "var1", "var1", 
"var1", "var2", "var2", "var2", "var2", "var2", "var2"), expl.val = c(0, 
0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5), pred.name = c("AllData_RUN1_ANN", 
"AllData_RUN2_ANN", "AllData_RUN3_ANN", "AllData_RUN1_GBM", "AllData_RUN2_GBM", 
"AllData_RUN3_GBM", "AllData_RUN1_ANN", "AllData_RUN2_ANN", "AllData_RUN3_ANN", 
"AllData_RUN1_GLM", "AllData_RUN2_GLM", "AllData_RUN3_GLM"), 
    pred.val = c(0.57727, 0.561696, 0.553354, 0.719538, 0.695912, 
    0.729262, 0.694463, 0.699222, 0.695147, 0.886816, 0.960639, 
    0.982607)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -12L), spec = structure(list(cols = list(
    name = structure(list(), class = c("collector_character", 
    "collector")), expl.val = structure(list(), class = c("collector_double", 
    "collector")), pred.name = structure(list(), class = c("collector_character", 
    "collector")), pred.val = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1L), class = "col_spec"))

CodePudding user response:

Data:

dat <- structure(list(name = c("var1", "var1", "var1", "var1", "var1", 
"var1", "var2", "var2", "var2", "var2", "var2", "var2"), expl.val = c(0L, 
0L, 0L, 0L, 0L, 0L, 5L, 5L, 5L, 5L, 5L, 5L), pred.name = c("RUN1_MODEL1", 
"RUN2_MODEL1", "RUN3_MODEL1", "RUN1_MODEL2", "RUN2_MODEL2", "RUN3_MODEL2", 
"RUN1_MODEL1", "RUN2_MODEL1", "RUN3_MODEL1", "RUN1_MODEL4", "RUN2_MODEL4", 
"RUN3_MODEL4"), pred.val = c(0.57727, 0.561696, 0.553354, 0.719538, 
0.695912, 0.729262, 0.694463, 0.699222, 0.695147, 0.886816, 0.960639, 
0.982607)), row.names = c(NA, -12L), class = "data.frame")

A data.table solution:

library(data.table)
setDT(dat)

unique( dat[, .(expl.val,pred.val=mean(pred.val)),
   by = .(name, pred.name=sub(".*_","RUNavg_",pred.name)) ] )[,c(1,3,2,4)]

   name expl.val     pred.name  pred.val
1: var1        0 RUNavg_MODEL1 0.5641067
2: var1        0 RUNavg_MODEL2 0.7149040
3: var2        5 RUNavg_MODEL1 0.6962773
4: var2        5 RUNavg_MODEL4 0.9433540

Also, a base R approach:

dat <- as.data.frame(dat)

setNames( aggregate( dat$pred.val,
   by=list( dat$name, dat$expl.val, sub(".*_","RUNavg_", dat$pred.name) ),
   mean ), colnames(dat) )

  name expl.val     pred.name  pred.val
1 var1        0 RUNavg_MODEL1 0.5641067
2 var2        5 RUNavg_MODEL1 0.6962773
3 var1        0 RUNavg_MODEL2 0.7149040
4 var2        5 RUNavg_MODEL4 0.9433540
  • Related