Replacing a for-loop which iteratively runs Backward Stepwise Regressions on a list of dfs in R with-CodePudding

The R scripts and a reduced version of the file folder with the multiple csv file-formatted datasets in it can all be found in my GitHub Repository found in this Link.

In my script called 'LASSO code', after loading a file folder full of N csv file-formatted datasets into R and assigning them all to a list called 'datasets', I ran the following code to fit N LASSO Regressions, one to each of the datasets:

set.seed(11)     # to ensure replicability
LASSO_fits <- lapply(dfs, function(i) 
               enet(x = as.matrix(select(i, starts_with("X"))), 
                    y = i$Y, lambda = 0, normalize = FALSE))

Now, I would like to replicate this same process for a Backward Elimination Stepwise Regression we'll keep it simple by just using the step() function from the stats library) using another apply function rather than having to use a loop. The problem is this, the only was I know how to do this is by initializing or prepping it before running it by first establishing:

set.seed(100)      # for reproducibility
full_fits <- vector("list", length = length(dfs))
Backward_Stepwise_fits <- vector("list", length = length(dfs))

And only then fitting all of the Backward_Stepwise_fits, but I cannot figure out how to put both full_fits and Backward_Stepwise_fits into the same apply function, the only way I can think of would be to use a for loop and stack them on top of each other inside of it, but that would be very computationally inefficient. And the number of datasets N I will be running both of these on is 260,000!

I wrote a for-loop that does in fact run, but it took over 12 hours to finish running on just 58,500 datasets which is unacceptably slow. The code I used for it is the following:

set.seed(100)      # for reproducibility
for(i in seq_along(dfs)) {
  full_fits[[i]] <- lm(formula = Y ~ ., data = dfs[[i]])
  Backward_Stepwise_fits[[i]] <- step(object = full_fits[[i]], 
                        scope = formula(full_fits[[i]]),
                        direction = 'backward', trace = 0) }

I have tried the following, but get the corresponding error message in the Console:

> full_model_fits <- lapply(datasets, function(i)
    lm(formula = Y ~ ., data = datasets))
Error in terms.formula(formula, data = data) : 
duplicated name 'X1' in data frame using '.'

CodePudding user response：

Ever thought about parallelizing the whole thing?

First, you could define the code more succinctly.

system.time(
  res <- lapply(lst, \(X) {
    full <- lm(Y ~ ., X)
    back <- step(full, scope=formula(full), dir='back', trace=FALSE)
  })
)
#  user  system elapsed 
# 3.895   0.008   3.897 

system.time(
  res1 <- lapply(lst, \(X) step(lm(Y ~ ., X), dir='back', trace=FALSE))
)
#  user  system elapsed 
# 3.820   0.016   3.833 

stopifnot(all.equal(res, res1))

The results are equal, but no time difference.

Now, using parallel::parLapply.

library(parallel)

CL <- makeCluster(detectCores() - 1L)
clusterExport(CL, c('lst'))

system.time(
  res2 <- parLapply(CL, lst, \(X) step(lm(Y ~ ., X), dir='back', trace=FALSE))
)
#  user  system elapsed 
# 0.075   0.032   0.861 

stopCluster(CL)

stopifnot(all.equal(res, res2))

On this machine about 4.5 times faster.

Your error duplicated name 'X1' in data frame using '.' means, that in some of your datasets there are two columns named "X1". Here's how to find them:

names(lst$dat6)[9] <- 'X1'  ## producing duplicated column X1 for demo 

sapply(lst, \(x) anyDuplicated(names(x)))
# dat1  dat2  dat3  dat4  dat5  dat6  dat7  dat8  dat9 dat10 dat11 
# 0     0     0     0     0     9     0     0     0     0     0 
# ...

Result shows, in dataset dat6 the 9th column is the (first) duplicate. All others are clean.

Data:

n <- 50
lst <- replicate(n, {dat <- data.frame(matrix(rnorm(500*30), 500, 30))
cbind(Y=rowSums(as.matrix(dat)%*%rnorm(ncol(dat)))   rnorm(nrow(dat)), dat)}, simplify=FALSE) |> 
  setNames(paste0('dat', seq_len(n)))