Please consider this minimal reproducible example of a random forest regression estimate
library(randomForest)
# fix missing data
airquality <- na.roughfix(airquality)
set.seed(123)
#fit the random forest model
rf_fit <- randomForest(formula = Ozone ~ ., data = airquality)
#define new observation
new <- data.frame(Solar.R=250, Wind=8, Temp=70, Month=5, Day=5)
set.seed(123)
#use predict all on new observation
rf_predict<-predict(rf_fit, newdata=new, predict.all = TRUE)
rf_predict$aggregate
library(tidyverse)
predict_mean <- rf_predict$individual %>%
as_tibble() %>%
rowwise() %>%
transmute(avg = mean(V1:V500))
predict_mean
I was expecting to get the same value by rf_predict$aggregate
and predict_mean
Where and why am I wrong about this assumption?
My final objective is to get a confidence interval of the predicted value.
CodePudding user response:
I believe your code needs to include a c_across()
call for the calculation to be performed correctly:
The ?c_across
documentations tells us:
c_across() is designed to work with rowwise() to make it easy to perform row-wise aggregations.
predict_mean <- rf_predict$individual %>%
as_tibble() %>%
rowwise() %>%
transmute(avg = mean(c_across(V1:V500)))
>predict_mean
[1] 30.5
An answer to a previous question, points out that mean()
can't handle a data.frame. And in your code the data being provide to mean()
is a row-wise data frame with class rowwise_df. c_across
allows the data in the rows to be presented to mean()
as vectors (I think).