Home > OS >  Merge overlapping ranges per group
Merge overlapping ranges per group

Time:03-10

I'm trying to merge intersecting ranges of values within each of my groups (n = 147). For example:

my.df <- data.frame(chrom=c('0F','0F','4F','4F','4F','4F'), start=as.numeric(c(1405,1700,1420,2500,19116,20070)), stop=as.numeric(c(1700,2038,2527,3401,20070,20730)), strand = c('-','-','-',' ',' ',' '))
my.df

  chrom start  stop strand
1    0F  1405  1700      -
2    0F  1700  2038      -
3    4F  1420  2527      -
4    4F  2500  3401       
5    4F 19116 20070       
6    4F 20070 20730       

And I am trying to find all of the overlapping ranges for each group while also preserving the 'chrm' column and taking into account the strand column and only merging ranges if they have the same 'strandedness':

  chrom start  stop strand
1    0F  1405  2038      -
2    4F  1420  2527      -
3    4F  2500  3401       
4    4F 19116 20730       

I've found a few methods for determining the presence of overlaps within each group (e.g., plyranges::count_overlaps), but no way to collapse those intersecting ranges together.

I've tried the method below from a previous question, but it ignores the groupings I require and the ranges for all of my groupings end up overlapping to give a single, continuous range regardless of if all ranges overlap. I've also tried the answers from this question, but none of them worked out.

my.df %>% 
       arrange(start) %>% 
       group_by(g = cumsum(cummax(lag(stop, default = first(stop))) < start)) %>% 
       summarise(start = first(start), stop = max(stop))

     start      end
1     1405    20730 

CodePudding user response:

I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.


> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]       4F   2500-3401       
  [2]       4F 19116-20730       
  [3]       4F   1420-2527      -
  [4]       0F   1405-1700      -
  [5]       0F   1727-2038      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

which differs from your expectation because there are two OF non-overlapping ranges?

  •  Tags:  
  • r
  • Related