Home > other >  randomForest importance measure percent MSE has different results depending on how it is called?
randomForest importance measure percent MSE has different results depending on how it is called?

Time:09-10

I was doing something with the randomForest package in R and I came across the following and was wondering why it happened.

If I create a random forest using the Boston housing data like so:

library(MASS)
library(randomForest)

data("Boston")

set.seed(101) 
rf <- randomForest(medv ~ ., data = Boston, importance = TRUE)

Then if I want to take a look at the importance (specifically the percent increase in MSE) I can do this:

> rf$importance
           %IncMSE IncNodePurity
crim     9.1072362     2556.4934
zn       0.6773818      269.0987
indus    5.5466086     2372.3983
chas     0.8297999      260.9694
nox     11.2673993     2870.7901
rm      35.1327635    12738.1708
age      3.6626845     1067.3055
dis      7.4365111     2497.2382
rad      1.5139761      394.6151
tax      3.9388712     1478.1129
ptratio  7.3573333     2789.9989
black    1.6405931      787.7995
lstat   57.1932326    11921.4248

however, if I call the specific importance function from randomForest I get this:

> randomForest::importance(rf)
          %IncMSE IncNodePurity
crim    17.748309     2556.4934
zn       3.333258      269.0987
indus   11.227245     2372.3983
chas     5.267923      260.9694
nox     19.850569     2870.7901
rm      36.633648    12738.1708
age     15.084757     1067.3055
dis     19.368978     2497.2382
rad      6.333343      394.6151
tax     12.125730     1478.1129
ptratio 15.120461     2789.9989
black    8.863837      787.7995
lstat   30.674737    11921.4248

As you can see the results for %IncMSE are different. For example, when using rf$importance the variable lstat is clearly the most important variable. However, when using randomForest::importance(rf), this method ranks rm the most important, followed closely by lstat.

Why am I getting different results for the %IncMSE on the same model fit?

CodePudding user response:

See the scale parameter in the documentation whch default is TRUE. Setting it to FALSE returns the same result.

scale For permutation based measures, should the measures be divided their “standard errors”?

library(MASS)
library(randomForest)
#> randomForest 4.7-1.1
#> Type rfNews() to see new features/changes/bug fixes.

data("Boston")

set.seed(101) 
rf <- randomForest(medv ~ ., data = Boston, importance = TRUE)

rf$importance[1:5,]
#>          %IncMSE IncNodePurity
#> crim   9.1072362     2556.4934
#> zn     0.6773818      269.0987
#> indus  5.5466086     2372.3983
#> chas   0.8297999      260.9694
#> nox   11.2673993     2870.7901


randomForest:::importance.randomForest(rf, scale = FALSE)[1:5,]
#>          %IncMSE IncNodePurity
#> crim   9.1072362     2556.4934
#> zn     0.6773818      269.0987
#> indus  5.5466086     2372.3983
#> chas   0.8297999      260.9694
#> nox   11.2673993     2870.7901
  • Related