Increasing for-loop speed in R-CodePudding

I have a problem with the speed of my for-loop.

The main problem is:

I have a data frame with cumulative probabilities (cDistr). There are 86 columns with 100 rows of cumulative probabilities in each column. The task is to set a random number and check whether this number exceeds the probabilities in each row for all columns. This is done with the functions. This part of the problem works fine, however, I need to run this for 40.000 individual agents over 100 periods which is the part I am struggling with. It currently is done in a for-loop and takes too long (~1 hour) and I was wondering if I could get ideas on what I could improve. If more information is needed, please let me know.

See below for a running example:

The function "draw" uses one column of the data frame, does the check and returns the number of TRUE statements.

draw <- function(cDistr){
  rv <- runif(1,0,1)
  toReturn <- 0
  for (i in 1:length(cDistr)){
    if (rv > cDistr[i]) {
      toReturn <- toReturn   1
    }
  }
  return(toReturn)
}

The function "drawVec" does the same as "draw" but over all the columns in the data frame returning a vector of size 86 (number of true statements over all columns).

drawVec <- function(vec){
  toReturn <- vector()
  for (i in 1:86){
    toReturn <- c(toReturn,draw(vec[,i]))
  }
  return(toReturn)
}

The code below is mimicking the actually used data so it is just exemplary. The data frame "cDistr" contains cumulative probabilities which are used above for checking.

 final <- data.frame(x=0:120)
 df1 <- data.frame(x=1:86)
 df1$r <- c(1:86)
 df1$l <- c(1:86)
 for(i in 1:86){
   final[,i]<-dpois(0:120,0.0005*df1$r[i]*df1$l[i])
 }
 cDistr<-cumsum(final)

Below is the main code that is causing the problems. I need to run it for 40000 Agents and save what I obtain from "drawVec" in the list. It takes approx. 1 min on my computer.

Agents <- data.frame(x=1:40000)
Catches_List <- replicate(n = nrow(Agents),
                          expr = {vector(length = 86, mode = "numeric")},simplify = F)

catch <- function(Agents){
for(i in 1:nrow(Agents)){
  rv <<- runif(1,0,1)
  catch <- drawVec(cDistr)
  Catches_List[[i]] <- catch
}
  return(Catches_List)
}

system.time(catch(Agents))

CodePudding user response：

I think you can speed this up greatly by not using the first function, though it would help a LOT to see the first few rows of data.

You have created logic to count TRUE/FALSE values with a counter, but if you have a column of true false values then you can add that column of values up and get the count of TRUE values because they equal 1.

R will naturally do this logical function on the whole column/vector passeed, so you can simplify and remove the loops

Accordingly you can pull that loop from the first function:

draw <- function(cDistr){
  rv <- runif(1,0,1)
  count=sum(cDistr>rv)
  return(count)
}

#test
draw(c(2,0,1))

Output: draw(c(2,0,1)) [1] 2

This initial step will relieve you of loops and simplify the counting of TRUE values

Then you can use the apply family to return a list of values for the entities in each column

df =data.frame(one=c(2,0,1), two=c(5,-2,3))  #test data
lapply(df, draw)

If you want a vector instead of a list:

as.vector(sapply(df, draw))

I do not know your exact data, as you posted none, so you may need to tweak the Draw function a bit, but this is very close. Once that is returning a count, the lapply returns a list of the counts of TRUE per column.

This should be much faster than a loop, but you will still have to wait a bit if there is big, big data.

EDIT:

I just ran on iteration of this on my code using your data here:

#dataframe to pass
final <- data.frame(x=0:120)
df1 <- data.frame(x=1:86)
df1$r <- c(1:86)
df1$l <- c(1:86)
for(i in 1:86){
  final[,i]<-dpois(0:120,0.0005*df1$r[i]*df1$l[i])
}
cDistr<-cumsum(final)

# running & testing
system.time(as.vector(sapply(cDistr, draw)))

[1] 121 121 121 121 121 121 121 121 121 121 121 [12] 121 121 121 121 121 121 121 121 121 121 121 [23] 121 120 121 120 121 120 121 120 118 120 121 [34] 120 121 120 121 119 121 120 121 121 121 118 [45] 118 121 120 121 121 120 120 120 119 119 120 [56] 121 118 119 121 121 121 120 121 120 119 118 [67] 120 114 120 118 121 117 119 118 120 119 120 [78] 120 119 120 118 117 118 119 118 119

user system elapsed 0.001 0.000 0.001

Using sapply and native functions this should run in less than a minute for 40,000 iterations

This last bit is just reusing the logic used to draw sapply() it takes about 14 seconds to run

catch <- function(agent){ #taking single row from the new df
  rv <- runif(1,0,1) #generates new random each row
   #uses cDistr to draw new vector as in drawVec above
  catch <- as.vector(sapply(cDistr, draw))
  return(catch)
}
 #apply to margin 1, the row, the catch function returning the draw vector and add them all back to Agents one row at a time
system.time(Agents<-apply(Agents, MARGIN = 1, FUN = catch))

This takes less than 30-seconds