Counting the # of correctly selected models by a feature selection ML algorithm in its output struct-CodePudding

The datasets and R scripts referred to in this question can all be found in my GitHub Repository for this project.

The goal is to count how many regression models fitted by a LASSO function in R using the enet function from the elasticnet package on 58k different csv file formatted datasets (all within the same file folder) are correct. Whether a selected model is correct is determined by checking if the included variables in the fitted model for that dataset exactly matches the true underlying regression model for it (it being that dataset). These datasets were generated using a custom Excel macro built in such a way that I know the correct underlying structural model for each dataset (the details are explained in a p.s. section).

I have exported the variables selected by LASSO (when running the code in the 'LASSO code' script) for each dataset to a csv file called 'IVs_Selected_by_LASSO', then re-imported them into a different R script called 'Quantifying LASSO's performance' (and assigned them to an object object called 'BM1_models') after sorting them correctly.

All of the fitted models are stored in the BM1_models object which looks like the following where the n1-n2-n3-n4s before the semicolons represent the names of each csv file and what comes after them are obviously the models selected by the LASSO Regression run on the dataset in that csv file:

> BM1_models <- read.csv("IVs_Selected_by_LASSO.csv", header = FALSE)
> head(BM1_models, n = 3)
                    V1
1 0-3-1-1;  X1, X2, X3
2 0-3-1-2;  X1, X2, X3
3 0-3-1-3;  X1, X2, X3

> tail(BM1_models, n = 3)
                                                           V1
57998 1-15-9-498;  X2, X3, X5, X6, X8, X9, X10, X11, X12, X15
57999     1-15-9-499;  X3, X4, X5, X6, X8, X10, X11, X12, X15
58000               1-15-9-500;  X2, X4, X6, X7, X8, X10, X11

> str(BM1_models)
'data.frame':   58000 obs. of  1 variable:
 $ V1: chr  "0-3-1-1;  X1, X2, X3" "0-3-1-2;  X1, X2, X3" "0-3-1-3;  X1, X2, X3" "0-3-1-4;  X1, X2, X3" ...

For the record, there are two spaces after each semicolon, not just one.

p.s. How to tell whether the ML variable/factor selection method (in this case LASSO) is right for any given dataset is if the n2 for that dataset says 3, then the Independent Variables selected should be X1, X2, X3; if it says 4, the underlying structural model is X1, X2, X3, X4, and so on up until X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15 when it says 15. If the selected model is either X1, X3, X4 or X2, X3, X4 when n2 = 3, or any other combination besides X1, X2, X3, this is wrong, it must be exact.

CodePudding user response：

Consider nested strsplit then rbind split vectors into data frame:

BM1_models <- read.csv("IVs_Selected_by_LASSO.csv", header = FALSE)

n_df <- do.call(
  rbind.data.frame,
  lapply(
    strsplit(BM1_models$V1, ";"),
    function(x) {
      s <- strsplit(x, "-")
      c(s[[1]], s[[2]])
    } 
  )
) |> setNames(
  c("n1", "n2", "n3", "n4", "IV")
)

head(n_df)
#   n1 n2 n3 n4           IV
# 1  0  3  1  1   X1, X2, X3
# 2  0  3  1  2   X1, X2, X3
# 3  0  3  1  3   X1, X2, X3
# 4  0  3  1  4   X1, X2, X3
# 5  0  3  1  5   X1, X2, X3
# 6  0  3  1  6   X1, X2, X3

tail(n_df)
#       n1 n2 n3  n4                                                IV
# 57995  1 15  9 495   X2, X3, X4, X5, X7, X9, X10, X11, X12, X13, X15
# 57996  1 15  9 496                     X4, X6, X7, X8, X11, X12, X13
# 57997  1 15  9 497                X2, X3, X4, X9, X10, X11, X13, X14
# 57998  1 15  9 498        X2, X3, X5, X6, X8, X9, X10, X11, X12, X15
# 57999  1 15  9 499            X3, X4, X5, X6, X8, X10, X11, X12, X15
# 58000  1 15  9 500                      X2, X4, X6, X7, X8, X10, X11

Then analyze or subset return as needed:

# TABULATE n2 COLUMN
table(n_df$n2)
#   10   11   12   13   14   15    3    4    5    6    7    8    9 
# 4500 4000 4500 4500 4500 4500 4500 4500 4500 4500 4500 4500 4500 

# SUBSET TO NEEDED CRITERIA
sub_n_df <- subset(n_df, n2 == "3")