I have a problem with the speed of my for-loop.
The main problem is:
I have a data frame with cumulative probabilities (cDistr). There are 86 columns with 100 rows of cumulative probabilities in each column. The task is to set a random number and check whether this number exceeds the probabilities in each row for all columns. This is done with the functions. This part of the problem works fine, however, I need to run this for 40.000 individual agents over 100 periods which is the part I am struggling with. It currently is done in a for-loop and takes too long (~1 hour) and I was wondering if I could get ideas on what I could improve. If more information is needed, please let me know.
See below for a running example:
The function "draw" uses one column of the data frame, does the check and returns the number of TRUE statements.
draw <- function(cDistr){
rv <- runif(1,0,1)
toReturn <- 0
for (i in 1:length(cDistr)){
if (rv > cDistr[i]) {
toReturn <- toReturn 1
}
}
return(toReturn)
}
The function "drawVec" does the same as "draw" but over all the columns in the data frame returning a vector of size 86 (number of true statements over all columns).
drawVec <- function(vec){
toReturn <- vector()
for (i in 1:86){
toReturn <- c(toReturn,draw(vec[,i]))
}
return(toReturn)
}
The code below is mimicking the actually used data so it is just exemplary. The data frame "cDistr" contains cumulative probabilities which are used above for checking.
final <- data.frame(x=0:120)
df1 <- data.frame(x=1:86)
df1$r <- c(1:86)
df1$l <- c(1:86)
for(i in 1:86){
final[,i]<-dpois(0:120,0.0005*df1$r[i]*df1$l[i])
}
cDistr<-cumsum(final)
Below is the main code that is causing the problems. I need to run it for 40000 Agents and save what I obtain from "drawVec" in the list. It takes approx. 1 min on my computer.
Agents <- data.frame(x=1:40000)
Catches_List <- replicate(n = nrow(Agents),
expr = {vector(length = 86, mode = "numeric")},simplify = F)
catch <- function(Agents){
for(i in 1:nrow(Agents)){
rv <<- runif(1,0,1)
catch <- drawVec(cDistr)
Catches_List[[i]] <- catch
}
return(Catches_List)
}
system.time(catch(Agents))
CodePudding user response:
I think you can speed this up greatly by not using the first function, though it would help a LOT to see the first few rows of data.
You have created logic to count TRUE/FALSE
values with a counter, but if you have a column of true false values then you can add that column of values up and get the count of TRUE
values because they equal 1.
R will naturally do this logical function on the whole column/vector passeed, so you can simplify and remove the loops
Accordingly you can pull that loop from the first function:
draw <- function(cDistr){
rv <- runif(1,0,1)
count=sum(cDistr>rv)
return(count)
}
#test
draw(c(2,0,1))
Output: draw(c(2,0,1)) [1] 2
This initial step will relieve you of loops and simplify the counting of TRUE values
Then you can use the apply
family to return a list of values for the entities in each column
df =data.frame(one=c(2,0,1), two=c(5,-2,3)) #test data
lapply(df, draw)
If you want a vector instead of a list:
as.vector(sapply(df, draw))
I do not know your exact data, as you posted none, so you may need to tweak the Draw function a bit, but this is very close. Once that is returning a count, the lapply returns a list of the counts of TRUE per column.
This should be much faster than a loop, but you will still have to wait a bit if there is big, big data.
EDIT:
I just ran on iteration of this on my code using your data here:
#dataframe to pass
final <- data.frame(x=0:120)
df1 <- data.frame(x=1:86)
df1$r <- c(1:86)
df1$l <- c(1:86)
for(i in 1:86){
final[,i]<-dpois(0:120,0.0005*df1$r[i]*df1$l[i])
}
cDistr<-cumsum(final)
# running & testing
system.time(as.vector(sapply(cDistr, draw)))
[1] 121 121 121 121 121 121 121 121 121 121 121 [12] 121 121 121 121 121 121 121 121 121 121 121 [23] 121 120 121 120 121 120 121 120 118 120 121 [34] 120 121 120 121 119 121 120 121 121 121 118 [45] 118 121 120 121 121 120 120 120 119 119 120 [56] 121 118 119 121 121 121 120 121 120 119 118 [67] 120 114 120 118 121 117 119 118 120 119 120 [78] 120 119 120 118 117 118 119 118 119
user system elapsed 0.001 0.000 0.001
Using sapply and native functions this should run in less than a minute for 40,000 iterations
This last bit is just reusing the logic used to draw sapply()
it takes about 14 seconds to run
catch <- function(agent){ #taking single row from the new df
rv <- runif(1,0,1) #generates new random each row
#uses cDistr to draw new vector as in drawVec above
catch <- as.vector(sapply(cDistr, draw))
return(catch)
}
#apply to margin 1, the row, the catch function returning the draw vector and add them all back to Agents one row at a time
system.time(Agents<-apply(Agents, MARGIN = 1, FUN = catch))
This takes less than 30-seconds