I have via lots of legwork and previous helpful answers to previous Stackoverflow questions successfully ran a uniquely fit LASSO Regression to each of my 47,000 individual 500 row by 31 columns (30 IVs & 1 DV columns) datasets for a research project and stored them in a list called LASSO_fits. From there, I have also separated out and stored only the coefficients returned by these 47k LASSOs in a list called LASSO_Coeffs. My question is how I can extract just the names of all of the Independent Variables/factors/columns which have been 'selected', i.e. chosen for each dataset i (where i ranges from 1:47k 1) by each of these LASSO regressions and assign them to a new list? To clarify, when I say those which have been selected, I mean those factors whose coefficients are greater than 0.
My plan was to make sure the following code for the single case runs fine, then generalize it by combining it with either lapply or a For Loop:
if (LASSO_Coeffs[[1]][["X1"]] > 0) {
print(names(LASSO2_Coeffs[[1]][["X1"]]))
}
However, my plan got derailed when the above code returned the following:
> if (LASSO_Coeffs[[1]][["X1"]] > 0) {
print(names(LASSO2_Coeffs[[1]][["X1"]]))
}
NULL
p.s. The following code to produce the LASSO_Coeffs & the LASSO_fits from whence it came are included below in case they are relevant (and the entire script, which is called "LASSO code.R" can be found in my Github repository): The code below is what I used to obtain all of the fitted LASSO estimates:
# This function fits all 47,000 LASSO regressions for/on
# each of the corresponding 47k datasets stored in the object
# of that name, then outputs standard regression results which
# are typically called returned for any regression ran using R
set.seed(11) # to ensure replicability
LASSO_fits <- lapply(datasets, function(i)
enet(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, lambda = 0, normalize = FALSE))
Then, using the code below, I separated out from LASSO_fits just the estimated coefficients for all 30 Independent Variable/factor columns for each of them, and stored them as a list in the object LASSO_Coeffs using the following code:
# This stores and prints out all of the regression
# equation specifications selected by LASSO when called
set.seed(11) # to ensure replicability
LASSO_Coeffs <- lapply(LASSO_fits,
function(i) predict(i,x = as.matrix(select(i,starts_with("X"))),
s = 0.1,mode = "fraction",
type = "coefficients")[["coefficients"]])
LASSO_Coeffs[[1]]
> LASSO_Coeffs[[1]]
X1 X2 X3 X4 X5 X6 X7
0.15516986 0.07733003 0.00000000 0.27838089 0.00000000 0.00000000 0.12361868
X8 X9 X10 X11 X12 X13 X14
0.31700186 0.13254325 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
X15 X16 X17 X18 X19 X20 X21
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
X22 X23 X24 X25 X26 X27 X28
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
X29 X30
0.00000000 0.00000000
The problem with the above output is that unlike when, for example, running a Stepwise Regression in R using the step() function on this same dataset where the final output is just the coefficients & their names for the factors selected by that Stepwise, when running a LASSO using enet(), all of them are returned by default.
CodePudding user response:
A very simple extension of my previous answer (the last line) gets only the coefficients above zero:
library(dplyr)
library(elasticnet)
dfs <- lapply(list.files("sample_obs2", full.names = TRUE, recursive = TRUE), read.csv)
models <- lapply(dfs, function(i) enet(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, lambda = 0, normalize = FALSE))
coeffs <- lapply(models, function(i) predict(i,
x = as.matrix(select(i, starts_with("X"))),
s = 0.1, mode = "fraction", type = "coefficients")[["coefficients"]])
coeffs_above_zero <- lapply(coeffs, function(i) i[i > 0])
Or alternatively to get only the names:
coeffs_above_zero <- lapply(coeffs, function(i) names(i[i > 0]))