Home > Enterprise >  Data frame creation inside Parlapply in R
Data frame creation inside Parlapply in R

Time:12-09

I am trying something pretty simple, want to run a bunch of regressions parallelly. When I use the following data generator (PART 1), The parallel part does not work and give the error listed below

#PART 1
p <- 20; rho<-0.7;
cdc<- diag(p)
for( i in 1:(p-1) ){ for( j in (i 1):p ){
  cdc[i,j] <- cdc[j,i] <- rho^abs(i-j)
}}
my.data <- mvrnorm(n=100, mu = rep(0, p), Sigma = cdc)

The following Parallel Part does work but if I generate the data as PART 2

# PART 2
my.data<-matrix(rnorm(1000,0,1),nrow=100,ncol=10)

I configured the function that I want to run parallelly... as

parallel_fun<-function(obj,my.data){
  p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
  count<-1
  for (itration in obj) {
    my_df<-data.frame(my.data)
    colnames(my_df)[itration] <- "y"
    my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
                     prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
    cf<-coef(my.model, estimator="MPM") 
    betas<-cf$postmean[-1]
    store.beta[ -itration, count]<- betas
    count<-count 1
  }
  result<-list('Beta'=store.beta)
}

So I write the following way of running parlapply


{
  no_cores <- detectCores(logical = TRUE)  
  myclusternumber<-(no_cores-1)
  cl <- makeCluster(myclusternumber)  
  registerDoParallel(cl)
  p1 <- ncol(my.data)
  obj<-splitIndices(p1, myclusternumber) 
  clusterExport(cl,list('parallel_fun','my.data','obj'),envir=environment())
   clusterEvalQ(cl, {
    library(MASS)
    library(Matrix)
    library(BAS)
  })
  newresult<-parallel::parLapply(cl,obj,fun = parallel_fun,my.data)
  stopCluster(cl)
  
}

But whenever am doing PART 1 I get the following error

Error in checkForRemoteErrors(val) : 7 nodes produced errors; first error: object 'my_df' not found

But this should not happen, the data frame should be created, I have no idea why this is happening. Any help is appreciated.

CodePudding user response:

Posting this as one possible workaround, see if it works:

parallel_fun<-function(obj,my.data){
  p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
  count<-1
  for (itration in obj) {
    my_df<-data.frame(my.data)
    colnames(my_df)[itration] <- "y"
    my_df <<- my_df
    my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
                     prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
    cf<-BAS:::coef.bas(my.model, estimator="MPM") 
    betas<-cf$postmean[-1]
    store.beta[ -itration, count]<- betas
    count<-count 1
  }
  result<-list('Beta'=store.beta)
}

The issue seems to be with BAS:::coef.bas function, that calls eval in order to get my_df and fails to do that when called in parallel. The "hack" here is to force my_df out to the parent environment by calling my_df <<- my_df.

There should be a better way to do this, but <<- might be the fastest one. In general, <<- may cause unwanted behaviour, especially when used in loops. Assigning unique variable name before exporting (and don't forgetting to remove after use) is one way to tackle them.

  • Related