R - data.matrix() that keeps factors as factors?-CodePudding

My apologies -- I am not sure where to ask this.

I am trying to run cross-validation on a data set. I am trying to run it with my predictors as integers and with them as factors (I have been advised to do both).

I am using cva.glmnet() from the glmnetUtils package for the cross-validation. However, turning my predictors into a data.matrix turns all of the factors into integers (according to the documentation at link, "Logical and factor columns are converted to integers."

How can I run cross-validation and keep my factors as factors? Using the typical method of

cva.glmnet([outcome] ~ [predictors], family = "binomial")

does not work, as the data set has 1550 rows and 74,417 columns, and that method didn't finish after over 8 hours.

If I can use my GPU to speed things up, that would be great, but I haven't found a way to do so.

Any help is appreciated.

Sorry for no reproducible data, but I don't know how to make a data set that large easily.

Thank you.

CodePudding user response：

Your data size is too large for the cva.glmnet function. My recommendation is to reduce your predictors to the most informative ones. To do this you could try feature selection methods like recursive methods. Perhaps the caret package would be helpful as it has nice functions to help reducing the features.

CodePudding user response：

I have one suggestion that might help. I don't know if your data set has 74,417 columns before or after the categorical variables are expanded into dummy variables; if the former, then it might be hugely much larger after expansion (if there are n1 numeric variables and a set of categorical variables f, and if n(f,i) is the number of levels/unique values of the categorical variables, then the final data frame will have 1 n1 sum(n(f,i)-1) columns).

What you might do is convert your data frame to a sparse model matrix before calling cva.glmnet(); in other words, something like

X <- glmnet::makeX(train = subset(data, select = -[outcome]), sparse = TRUE)
cva.glmnet(X, [outcome], family = "binomial")

makeX will convert the dummy variables to one-hot encoded dummy variables (this is not quite the same as cva.glmnet, which will presumably use the default treatment contrasts to make them into sets of one-hot encoded variables not including the baseline level, but I think the effects will be minimal; you could alternatively use Matrix::sparse.model.matrix())