I have a file with interval values such as this for 50M lines:
>data
start_pos end_pos
1 1 10
2 3 6
3 5 9
4 6 11
And I would like to have a table of position occurrences so that I can compute the coverage on each position in the interval file such as this:
>occurence
position coverage
1 1
2 1
3 2
4 2
5 3
6 4
7 3
8 3
9 3
10 2
11 1
Is there any fast and best way to complete this task in R?
My plan was to loop through the data and concatenate the sequence in each interval into a vector and convert the final vector into a table.
count<-c()
for (row in 1:nrow(data)){
count<-c(count,(data[row,]$start_pos:data[row,]$end_pos))
}
occurence <- table(count)
The problem is that my file is huge and it takes way to much time and memory to do so.
CodePudding user response:
The Bioconductor IRanges package does this fast and efficiently
library(IRanges)
ir = IRanges(start = c(1, 3, 5, 6), end = c(10, 6, 9, 11))
coverage(ir)
with
> coverage(ir) |> as.data.frame()
value
1 1
2 1
3 2
4 2
5 3
6 4
7 3
8 3
9 3
10 2
11 1