Home > OS >  randomForest in R and factor variables
randomForest in R and factor variables

Time:06-25

I have a dataset with some continuous variables, some ordinal variables and some categorical qualitative variables.

I would like to use a random forest classifier (I have a categorical outcome), but I am not sure how to treat the ordinal and categorical features, which are both coded as factor at the moment. I would like the ordinal variables to be considered as numeric and the qualitative ones to have each level as a separate dummy. How does R's randomForest normally handle factor features? Should I transform the qualitative variables into dummies and the ordinal ones into integer or numeric?

CodePudding user response:

Factors are encoded by introducing dummy varaibles that allow for "one-hot" coding. k levels are encoded in k-1 dummy variables. How these represent the levels depends on your choice of the "contrasts" setting. You can test it with contrasts, e.g.

> contrasts(iris$Species)
           versicolor virginica
setosa              0         0
versicolor          1         0
virginica           0         1

Encoding an ordinal variable as a factor thus adds degrees of freedom, which may or may not be what you want. If you want to keep the information about the ordering of the levels, I would just encode the ordinal variable as an integer.

  • Related