Can somebody help me please? I want to prepare data for XGBoost prediction so I need edit factor datas. I use sparse.model.matrix() but there is a problem. I don't know, why function ignored some of the columns. I'll try to explain. I have dataset dataset with many variables, but now these 3 are important:
- Tsunami.Event.Validity - Factor with 6 classes: -1,0,1,2,3,4
- Tsunami.Cause.Code - Factor with 6 classes: 0,1,2,3,4,5
- Total.Death.Description - Factor with 5 classes: 0,1,2,3,4
But when I use sparse.model.matrix() I get matrix only with 15 columns not 6 6 5=17 as expected. Can somebody give ma an advice?
sp_matrix = sparse.model.matrix(Deadly ~ Tsunami.Event.Validity Tsunami.Cause.Code Total.Death.Description -1, data = datas)
str(sp_matrix)
Output:
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:2510] 0 1 2 3 4 5 6 7 8 9 ...
..@ p : int [1:16] 0 749 757 779 823 892 1495 2191 2239 2241 ...
..@ Dim : int [1:2] 749 15
..@ Dimnames:List of 2
.. ..$ : chr [1:749] "1" "2" "3" "4" ...
.. ..$ : chr [1:15] "Tsunami.Event.Validity-1" "Tsunami.Event.Validity0" "Tsunami.Event.Validity1" "Tsunami.Event.Validity2" ...
..@ x : num [1:2510] 1 1 1 1 1 1 1 1 1 1 ...
..@ factors : list()
..$ assign : int [1:15] 0 1 1 1 1 1 2 2 2 2 ...
..$ contrasts:List of 3
.. ..$ Tsunami.Event.Validity : chr "contr.treatment"
.. ..$ Tsunami.Cause.Code : chr "contr.treatment"
.. ..$ Total.Death.Description: chr "contr.treatment"
CodePudding user response:
This question is a duplicate of In R, for categorical data with N unique categories, why does sparse.model.matrix() not produce a one-hot encoding with N columns? ... but that question was never answered.
The answers to this question explain how you could get the full model matrix you're looking for, but don't explain why you might not want to. (For what it's worth, unlike regular linear models regression trees are robust to multicollinearity, so a full model matrix would actually work in this case, but it's worth understanding why R gives you the answer it does, and why this won't hurt your predictive accuracy ...)
This is a fundamental property of the way that linear models based (additively) on more than one categorical predictor work (and hence the way that R constructs model matrices). When you construct a model matrix based on factors f1, ..., fn
with numbers of levels n1, ..., nn
the number of predictor variables is 1 sum(ni-1)
, not sum(ni)
. Let's see how this works with a slightly simpler example:
xx <- expand.grid(A=factor(1:2), B = factor(1:2), C = factor(1:2))
model.matrix(~A B C-1, xx)
A1 A2 B2 C2
1 1 0 0 0
2 0 1 0 0
3 1 0 1 0
4 0 1 1 0
5 1 0 0 1
6 0 1 0 1
7 1 0 1 1
8 0 1 1 1
We have a total of (1 3*(2-1) =) 4 parameters.
The first parameter (A1
) describes the expected mean in the baseline level of all parameters (A=1
, B=1
, C=1
). The second parameter describes the expected difference between an observation with A=1
and one with A=2
(independent of the other factors). Parameters 3 and 4 (B2
, C2
) describe analogous differences between B1
and B2
.
You might be thinking "but I want predictor variables for all the levels of all the factors, e.g.
m <- do.call(cbind, lapply(xx, \(x) t(fac2sparse(x))))
dim(m)
## [1] 8 6
This has all six columns expected, not just 4. But if you examine this matrix, or call rankMatrix(m)
or caret::findLinearCombos(m)
, you'll discover that it is multicollinear. In a typical (fixed-effect) additive linear model, you can only estimate an intercept plus the differences between levels, not values associated with every level. In a regression tree model, the multicollinearity will make your computations slightly less efficient, and will make results about variable importance confusing, but shouldn't hurt your predictions.