I'm trying to merge intersecting ranges of values within each of my groups (n = 147). For example:
my.df <- data.frame(chrom=c('0F','0F','4F','4F','4F','4F'), start=as.numeric(c(1405,1700,1420,2500,19116,20070)), stop=as.numeric(c(1700,2038,2527,3401,20070,20730)), strand = c('-','-','-',' ',' ',' '))
my.df
chrom start stop strand
1 0F 1405 1700 -
2 0F 1700 2038 -
3 4F 1420 2527 -
4 4F 2500 3401
5 4F 19116 20070
6 4F 20070 20730
And I am trying to find all of the overlapping ranges for each group while also preserving the 'chrm' column and taking into account the strand column and only merging ranges if they have the same 'strandedness':
chrom start stop strand
1 0F 1405 2038 -
2 4F 1420 2527 -
3 4F 2500 3401
4 4F 19116 20730
I've found a few methods for determining the presence of overlaps within each group (e.g., plyranges::count_overlaps), but no way to collapse those intersecting ranges together.
I've tried the method below from a previous question, but it ignores the groupings I require and the ranges for all of my groupings end up overlapping to give a single, continuous range regardless of if all ranges overlap. I've also tried the answers from this question, but none of them worked out.
my.df %>%
arrange(start) %>%
group_by(g = cumsum(cummax(lag(stop, default = first(stop))) < start)) %>%
summarise(start = first(start), stop = max(stop))
start end
1 1405 20730
CodePudding user response:
I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.
> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 4F 2500-3401
[2] 4F 19116-20730
[3] 4F 1420-2527 -
[4] 0F 1405-1700 -
[5] 0F 1727-2038 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
which differs from your expectation because there are two OF
non-overlapping ranges?