Is the subsets function in R appropriate for really.big cases?-CodePudding

I am dealing with a p >>> n case with over 300 variables with only 200 observations. I want to reduce the number of variables since it is a nightmare to fit any sort of model with 300 variables. I read up on ways to reduce them and the best I found was the subsets function in leaps but I have to specify really.big = T since well, it really is large. However, I left my computer running it for over an hour but it still doesn't finish. Is there any other function that might fare better?

CodePudding user response：

With 300 variables, there are 2^300 = 2e90 (i.e. 2 * 10^90) possible submodels. That's going to be impossible to evaluate exhaustively on any computer you can imagine. I would suggest LASSO regression via the glmnet package. You have to make your own model matrix, i.e. if your response variable is in column 1 of your data frame you need

library(glmnet)
y <- dd[,1]
X <- as.matrix(dd[,-1])

assuming all continuous predictors; otherwise you need X <- model.matrix(~ ., data = dd[,-1]). There are a few extra steps to LASSO regression (picking the regularization parameter), see vignette("glmnet", package = "glmnet") ...

Alternatively, you can use the glmulti package, which can use a genetic algorithm to search the space of possible models (won't try everything, but can try an arbitrarily large number of candidates until the goodness of fit seems to have leveled off).