decrease the time of script running (for loop based)-CodePudding

I wrote a script based on two for loops that I would like to optimize to speed up its running time.

Below are reproducible data that I simplified with the code that I am using on my own data.

nuc is a vector with 101 "position" and tel is a data frame with different coordinates "aa" and "bb"

The aim is to calculate for each position the number of times each position is comprised between each aa and bb coordinate. For example position 111 is comprise between 3 couple of coordinates : G, I and J

#data

tel=data.frame(aa=c(153,113,163,117,193,162,110,109,186,103),
               bb=c(189,176,185,130,200,189,156,123,198,189),
               ID=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"))

> tel
    aa  bb ID
1  153 189  A
2  113 176  B
3  163 185  C
4  117 130  D
5  193 200  E
6  162 189  F
7  110 156  G
8  109 123  H
9  186 198  I
10 103 189  J

nuc2=100:200

# Loop

count_occ=0
count_occ_int=NULL
count_occ_fin=NULL

for (j in 1:length(nuc2)){
  for (i in 1:nrow(tel)) {
    
    if (nuc2[j]< tel$bb[i] & nuc2[j]>tel$aa[i]) 
    {count_occ=count_occ 1}
  }
  count_occ_int=count_occ
  count_occ_fin=c(count_occ_fin,count_occ_int)
  count_occ=0
}

nuc_occ=data.frame(nuc=nuc2, occ=count_occ_fin)

> head(nuc_occ,20)
   nuc occ
1  100   0
2  101   0
3  102   0
4  103   0
5  104   1
6  105   1
7  106   1
8  107   1
9  108   1
10 109   1
11 110   2
12 111   3
13 112   3
14 113   3
15 114   4
16 115   4
17 116   4
18 117   4
19 118   5
20 119   5

In my data, the length of my nuc vector is 9304567 and the number of couple of coordinates is 53 (I will have some hundred soon) and it took more than 60 hours to run the code !!

Any idea to help me to speed up this code ?

I though to the apply function but I am not sure how to combine the two for loop operations.

CodePudding user response：

Here's a tidyverse solution:

lapply(
  100:200,
  \(x) tel %>% 
         filter(aa <= x & x <= bb) %>% 
         summarise(occ=n(), .groups="drop") %>% 
         add_column(nuc=x, .before=1)
) %>% 
bind_rows() %>% 
as_tibble()
# A tibble: 101 × 2
     nuc   occ
   <int> <int>
 1   100     0
 2   101     0
 3   102     0
 4   103     1
 5   104     1
 6   105     1
 7   106     1
 8   107     1
 9   108     1
10   109     2
# … with 91 more rows

Using microbenchmark to assess performance, this gives

Unit: nanoseconds
     expr min lq mean median uq max neval
   lapply   7  9  8.8      9  9   9    10
 original   8  9 23.8      9  9 158    10

In other words, a decrease in speed of about two-thirds. And the tidyverse is not known for speed. A base R solution is likely to be faster still.

CodePudding user response：

You can use data.table non-equi join like this:

library(data.table)
setDT(tel)[SJ(v=nuc2), on=.(aa<=v, bb>=v)][,.(occ = sum(!is.na(ID))), by=.(nuc=aa)]

Output:

       nuc   occ
     <int> <int>
  1:   100     0
  2:   101     0
  3:   102     0
  4:   103     1
  5:   104     1
 ---            
 97:   196     2
 98:   197     2
 99:   198     2
100:   199     1
101:   200     1