My goal/task is to fit a LASSO Regression function using the enet() function from the elastic net package in R to each of the 47,000 individual csv file formatted datasets which are all located in the same large file folder called "sample_obs". Each csv file's name is formatted as follows: #-#-#-#, for example, the first 3 of them are called: "0.4-3-1-1", "0.4-3-1-2", "0.4-3-1-3"
Once all of them have been fitted on each dataset and their output stored in a list with 47k elements, all I have left to do is to separate out just the factors (aka predictors or Independent Variables) chosen/'selected' by LASSO j for dataset j and store each of them in another list. So, my final desired output should look like 1 of the following for each list element, either: X#, X#, X#, X#, etc. (the number of X#s returned for any given dataset can range anywhere from 1 to 30 because each dataset has 30 candidate predictors/factors in it) OR 1, 2, 5, 6, 9, 26, 29 as just 1 possible example.
Towards completing this task, I decided to start out with figuring out how to do all of this on a single one of the csv file formatted datasets which has been loaded into R by itself and assigned to its own object. To do this, I used a csv file from a much smaller dataset folder called "sample_obs2" in order to GREATLY reduce the runtime required! You can find the sample_obs2 dataset on my Github account in the "Estimated-Exhaustive_Regression-Project" repository.
Here is the code I wrote and the output I got for that simpler version:
setwd("~/DAEN_698/other datasets/sample_obs2")
> setwd("~/DAEN_698/other datasets/sample_obs2")
> getwd()
[1] "C:/Users/Spencer/Documents/DAEN_698/other datasets/sample_obs2"
# read the data in from the first csv file in the file folder
dataset_1 <- read.csv("0-5-1-1.csv")
head(dataset_1, n = 1)
> head(dataset_1, n = 1)
Y X1 X2 X3 X4 X5 X6 X7 X8 X9
1 5.70511 1.339406 1.033558 0.4749296 0.3720555 0.928961 0.3804003 -0.4386075 0.786346 -0.6860546
X10 X11 X12 X13 X14 X15 X16 X17 X18
1 -0.8863821 -0.9128645 -0.08443444 -0.2918255 1.527747 -0.8496993 0.9825339 0.8999604 -1.047078
X19 X20 X21 X22 X23 X24 X25 X26 X27
1 0.07337369 -1.429877 -0.1062012 -0.6954525 1.025954 0.7472764 -0.02252112 0.0932389 1.173201
X28 X29 X30
1 2.061864 -1.129998 0.1931626
set.seed(50)
LASSO2_fit1 <- enet(x = as.matrix(dataset_1[2:31]),
y = dataset_1$Y, lambda = 0, normalize = FALSE)
LASSO_coeffs1 <- predict(LASSO2_fit1,
x = as.matrix(dataset_1[2:31]),
s = 0.1, mode = "fraction", type = "coefficients")
LASSO_coeffs1[["coefficients"]]
> LASSO_coeffs1[["coefficients"]]
X1 X2 X3 X4 X5 X6 X7 X8 X9
0.20039732 0.13671726 0.12411170 0.06292652 0.07892046 0.00000000 0.00000000 0.00000000 0.00000000
X10 X11 X12 X13 X14 X15 X16 X17 X18
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
X19 X20 X21 X22 X23 X24 X25 X26 X27
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
X28 X29 X30
0.00000000 0.00000000 0.00000000
This is fairly close to the final format of the output I am looking for, I am sure something like a For Loop with an IF function inside of it can get me the rest of the way there once I know how to repeat the above process and results for all 47k datasets! But, my problem is I have tried and failed at repeating that aforementioned process the iteratively over all 47k of my datasets.
CodePudding user response:
All we have to do is lapply
these functions to the list of dataframes that you have. Just one row of output coefficients per csv as expected.
library(dplyr)
dfs <- lapply(list.files("sample_obs2", full.names = TRUE, recursive = TRUE), read.csv)
models <- lapply(dfs, function(i) enet(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, lambda = 0, normalize = FALSE))
coeffs <- lapply(models, function(i) predict(i,
x = as.matrix(select(i, starts_with("X"))),
s = 0.1, mode = "fraction", type = "coefficients")[["coefficients"]])
coeffs[[1]]
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
# 0.20039732 0.13671726 0.12411170 0.06292652 0.07892046 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
# X21 X22 X23 X24 X25 X26 X27 X28 X29 X30
# 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000