I wrote a script based on two for loops that I would like to optimize to speed up its running time.
Below are reproducible data that I simplified with the code that I am using on my own data.
nuc is a vector with 101 "position" and tel is a data frame with different coordinates "aa" and "bb"
The aim is to calculate for each position the number of times each position is comprised between each aa and bb coordinate. For example position 111 is comprise between 3 couple of coordinates : G, I and J
#data
tel=data.frame(aa=c(153,113,163,117,193,162,110,109,186,103),
bb=c(189,176,185,130,200,189,156,123,198,189),
ID=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"))
> tel
aa bb ID
1 153 189 A
2 113 176 B
3 163 185 C
4 117 130 D
5 193 200 E
6 162 189 F
7 110 156 G
8 109 123 H
9 186 198 I
10 103 189 J
nuc2=100:200
# Loop
count_occ=0
count_occ_int=NULL
count_occ_fin=NULL
for (j in 1:length(nuc2)){
for (i in 1:nrow(tel)) {
if (nuc2[j]< tel$bb[i] & nuc2[j]>tel$aa[i])
{count_occ=count_occ 1}
}
count_occ_int=count_occ
count_occ_fin=c(count_occ_fin,count_occ_int)
count_occ=0
}
nuc_occ=data.frame(nuc=nuc2, occ=count_occ_fin)
> head(nuc_occ,20)
nuc occ
1 100 0
2 101 0
3 102 0
4 103 0
5 104 1
6 105 1
7 106 1
8 107 1
9 108 1
10 109 1
11 110 2
12 111 3
13 112 3
14 113 3
15 114 4
16 115 4
17 116 4
18 117 4
19 118 5
20 119 5
In my data, the length of my nuc vector is 9304567 and the number of couple of coordinates is 53 (I will have some hundred soon) and it took more than 60 hours to run the code !!
Any idea to help me to speed up this code ?
I though to the apply function but I am not sure how to combine the two for loop operations.
CodePudding user response:
Here's a tidyverse solution:
lapply(
100:200,
\(x) tel %>%
filter(aa <= x & x <= bb) %>%
summarise(occ=n(), .groups="drop") %>%
add_column(nuc=x, .before=1)
) %>%
bind_rows() %>%
as_tibble()
# A tibble: 101 × 2
nuc occ
<int> <int>
1 100 0
2 101 0
3 102 0
4 103 1
5 104 1
6 105 1
7 106 1
8 107 1
9 108 1
10 109 2
# … with 91 more rows
Using microbenchmark
to assess performance, this gives
Unit: nanoseconds
expr min lq mean median uq max neval
lapply 7 9 8.8 9 9 9 10
original 8 9 23.8 9 9 158 10
In other words, a decrease in speed of about two-thirds. And the tidyverse is not known for speed. A base R solution is likely to be faster still.
CodePudding user response:
You can use data.table non-equi join like this:
library(data.table)
setDT(tel)[SJ(v=nuc2), on=.(aa<=v, bb>=v)][,.(occ = sum(!is.na(ID))), by=.(nuc=aa)]
Output:
nuc occ
<int> <int>
1: 100 0
2: 101 0
3: 102 0
4: 103 1
5: 104 1
---
97: 196 2
98: 197 2
99: 198 2
100: 199 1
101: 200 1