I have a dataset with some continuous variables, some ordinal variables and some categorical qualitative variables.
I would like to use a random forest classifier (I have a categorical outcome), but I am not sure how to treat the ordinal and categorical features, which are both coded as factor
at the moment. I would like the ordinal variables to be considered as numeric and the qualitative ones to have each level as a separate dummy.
How does R's randomForest
normally handle factor
features? Should I transform the qualitative variables into dummies and the ordinal ones into integer or numeric?
CodePudding user response:
Factors are encoded by introducing dummy varaibles that allow for "one-hot" coding. k levels are encoded in k-1 dummy variables. How these represent the levels depends on your choice of the "contrasts" setting. You can test it with contrasts
, e.g.
> contrasts(iris$Species)
versicolor virginica
setosa 0 0
versicolor 1 0
virginica 0 1
Encoding an ordinal variable as a factor thus adds degrees of freedom, which may or may not be what you want. If you want to keep the information about the ordering of the levels, I would just encode the ordinal variable as an integer.