How can I make my for loop in R run faster? Can I vectorize this?-CodePudding

#Start: Initialize values

#For each block lengths (BlockLengths) I will run 10 estimates (ThetaL). For each estimate, I simulate 50000 observarions (Obs). Each estimate is calculated on the basis of the blocklength. 

Index=0                                        #Initializing Index. 
ThetaL=10                                      #Number of estimations of Theta. 
Obs=50000                                      #Sample size.
Grp=vector(length=7)                           #Initializing a vector of number of blocks. It is dependent on block lengths (see L:15)
Theta=matrix(data=0,nrow=ThetaL,ncol=7)        #Initializing a matrix of the estimates of Thetas. There are 10 for each block length.
BlockLengths<-c(10,25,50,100,125,200,250)      #Setting the block lengths

for (r in BlockLengths){                       
  Index=Index 1
  Grp[Index]=Obs/r                             
  
  for (k in 1:ThetaL){                         
    
#Start: Constructing the sample
    Y1<-matrix(data=0,nrow=Obs,ncol=2)      
    Y1[1,]<-runif(2,0,1)
    Y1[1,1]<--log(-(Y1[1,1])^2  1)
    Y1[1,2]<--log(-(Y1[1,2])^2  1)
    
    for (i in 2:Obs)
    {
      Y1[i,1]<-Y1[i-1,2]
      Y1[i,2]<-runif(1,0,1)
      Y1[i,2]<--log(-(Y1[i,2])^2  1)
    }
    
    X1 <- vector(length=Obs)
    for (i in 1:Obs){
      X1[i]<-max(Y1[i,])
    }
#End: Constructing the sample
    
    K=0                                         #K will counts number of blocks with at least one exceedance
    for (t in 1:Grp[Index]){                    #For loop from 1 to number of groups
      a=0
      for (j in (1 r*(t-1)):(t*r)){             #Loop for the sample within each group
        
        if (X1[j]>quantile(X1,0.99)){           #If a value exceeds high threshold, we add 1 to some variable a
          a=a 1
        }
        
      }
      
      if(a>=1){                                 #For the group, if a is larger than 1, we have had a exceedance.
        K=K 1                                   #Counts number of blocks with at least one exceedance.
      }
      
    }
    
    
    N<-sum(X1>=quantile(X1,0.99))               #Summing number of exceedances 
    
    
    
    Theta[k,Index]<- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs)))  #Estimate
    #Theta[k,Index]<-K/N
  }
}

I have been running the above code without errors and it took me about 20 minutes, but I want to run the code for larger sample and more repetitions, which makes the run time absurdly large. I tried to only have the necessary part inside the loops to optimize it a little. Is it possible to optimize it even further or should I use another programming language as I've read R is bad for "for loop". Will vectorization help? In case, how can I vectorize the code?

CodePudding user response：

First, you can define BlockLengths before Grp and Theta as both of them depend on it's length:

Index = 0
ThetaL = 2
Obs = 10000
BlockLengths = c(10,25)
Grp = vector(length = length(BlockLengths))
Theta = matrix(data = 0, nrow = ThetaL, ncol = length(BlockLengths))

Obs: I decreased the size of the operation so that I could run it faster. With this specification, your original loop took 24.5 seconds.

Now, for the operation, there where three points where I could improve:

Creation of Y1: the second column can be generated at once, just by creating Obs random numbers with runif(). Then, the first column can be created as a lag of the second column. With only this alteration, the loop ran in 21.5 seconds (12% improvement).
Creation of X1: you can vectorise the max function with apply. This alteration saved further 1.5 seconds (6% improvement).
Calculation of K: you can, for each t, get all the values of X1[(1 r*(t-1)):(t*r)], and run the condition on all of them at once (instead of using the second loop). The any(...) does the same as your a>=1. Furthermore, you can remove the first loop using lapply vectorization function, then sum this boolean vector, yielding the same result as your combination of if(a>=1) and K=K 1. The usage of pipes (|>) is just for better visualization of the order of operations. This by far is the more important alteration, saving more 18.4 seconds (75% improvement).

for (r in BlockLengths){                       
  Index = Index   1
  Grp[Index] = Obs/r                             
  
  for (k in 1:ThetaL){                         
    Y1 <- matrix(data = 0, nrow = Obs, ncol = 2)
    Y1[,2] <- -log(-(runif(Obs))^2   1)
    Y1[,1] <- c(-log(-(runif(1))^2   1), Y1[-Obs,2])
    
    X1 <- apply(Y1, 1, max)
    
    K <- lapply(1:Grp[Index], function(t){any(X1[(1 r*(t-1)):(t*r)] > quantile(X1,0.99))}) |> unlist() |> sum()
    N <- sum(X1 >= quantile(X1, 0.99))
    Theta[k,Index] <- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs)))
  }
}

Using set.seed() I got the same results as your original loop.

A possible way to improve more is substituting the r and k loops with purrr::map function.