R random forest aggregate vs individual prediction-CodePudding

Please consider this minimal reproducible example of a random forest regression estimate

library(randomForest)

# fix missing data
airquality <- na.roughfix(airquality)

set.seed(123)
#fit the random forest model
rf_fit <- randomForest(formula = Ozone ~ .,  data = airquality)

#define new observation
new <- data.frame(Solar.R=250, Wind=8, Temp=70, Month=5, Day=5)

set.seed(123)
#use predict all on new observation
rf_predict<-predict(rf_fit, newdata=new, predict.all = TRUE)

rf_predict$aggregate

library(tidyverse)

predict_mean <- rf_predict$individual %>% 
  as_tibble() %>% 
  rowwise() %>% 
  transmute(avg = mean(V1:V500))

predict_mean

I was expecting to get the same value by rf_predict$aggregate and predict_mean

Where and why am I wrong about this assumption?

My final objective is to get a confidence interval of the predicted value.

CodePudding user response：

I believe your code needs to include a c_across() call for the calculation to be performed correctly:

The ?c_across documentations tells us:

c_across() is designed to work with rowwise() to make it easy to perform row-wise aggregations.

predict_mean <- rf_predict$individual %>% 
  as_tibble() %>% 
  rowwise() %>% 
  transmute(avg = mean(c_across(V1:V500)))

>predict_mean
[1] 30.5

An answer to a previous question, points out that mean() can't handle a data.frame. And in your code the data being provide to mean() is a row-wise data frame with class rowwise_df. c_across allows the data in the rows to be presented to mean() as vectors (I think).