I was doing something with the randomForest
package in R and I came across the following and was wondering why it happened.
If I create a random forest using the Boston housing data like so:
library(MASS)
library(randomForest)
data("Boston")
set.seed(101)
rf <- randomForest(medv ~ ., data = Boston, importance = TRUE)
Then if I want to take a look at the importance (specifically the percent increase in MSE) I can do this:
> rf$importance
%IncMSE IncNodePurity
crim 9.1072362 2556.4934
zn 0.6773818 269.0987
indus 5.5466086 2372.3983
chas 0.8297999 260.9694
nox 11.2673993 2870.7901
rm 35.1327635 12738.1708
age 3.6626845 1067.3055
dis 7.4365111 2497.2382
rad 1.5139761 394.6151
tax 3.9388712 1478.1129
ptratio 7.3573333 2789.9989
black 1.6405931 787.7995
lstat 57.1932326 11921.4248
however, if I call the specific importance function from randomForest
I get this:
> randomForest::importance(rf)
%IncMSE IncNodePurity
crim 17.748309 2556.4934
zn 3.333258 269.0987
indus 11.227245 2372.3983
chas 5.267923 260.9694
nox 19.850569 2870.7901
rm 36.633648 12738.1708
age 15.084757 1067.3055
dis 19.368978 2497.2382
rad 6.333343 394.6151
tax 12.125730 1478.1129
ptratio 15.120461 2789.9989
black 8.863837 787.7995
lstat 30.674737 11921.4248
As you can see the results for %IncMSE
are different. For example, when using rf$importance
the variable lstat
is clearly the most important variable. However, when using randomForest::importance(rf)
, this method ranks rm
the most important, followed closely by lstat
.
Why am I getting different results for the %IncMSE
on the same model fit?
CodePudding user response:
See the scale
parameter in the documentation whch default is TRUE
. Setting it to FALSE
returns the same result.
scale
For permutation based measures, should the measures be divided their “standard errors”?
library(MASS)
library(randomForest)
#> randomForest 4.7-1.1
#> Type rfNews() to see new features/changes/bug fixes.
data("Boston")
set.seed(101)
rf <- randomForest(medv ~ ., data = Boston, importance = TRUE)
rf$importance[1:5,]
#> %IncMSE IncNodePurity
#> crim 9.1072362 2556.4934
#> zn 0.6773818 269.0987
#> indus 5.5466086 2372.3983
#> chas 0.8297999 260.9694
#> nox 11.2673993 2870.7901
randomForest:::importance.randomForest(rf, scale = FALSE)[1:5,]
#> %IncMSE IncNodePurity
#> crim 9.1072362 2556.4934
#> zn 0.6773818 269.0987
#> indus 5.5466086 2372.3983
#> chas 0.8297999 260.9694
#> nox 11.2673993 2870.7901