Home > Back-end >  Gather overlapping coordinates columns within same groups in R
Gather overlapping coordinates columns within same groups in R

Time:03-29

I have a dataframe such as

    Seq Chrm  start  end  length  score
0     A   C1      1   50      49     12
1     B   C1      3   55      52     12
2     C   C1      6   60      54     12
3  Cbis   C1      6   60      54     11
4     D   C1     70  120      50     12
5     E   C1     78  111      33     12
6     F   C2    350  400      50     12
7     A   C2    349  400      51     12
8     B   C2    450  500      50     12

And I would like, within each specific Chrm, to keep within each overlapping start and end the row with the longest length value AND the highest Score value.

For example in C1:

Seq    Chrm start end  length score
A      C1   1     50   49     12
B      C1   3     55   52     12
C      C1   6     60   54     12
Cbis   C1   6     60   54     11
D      C1   70    120  50     12
E      C1   78    111  33     12
 

Coordinates from start to end of A,B,C,Cbis together overlaps and D and E together overlaps.

In the A,B,C,Cbis group the longest are C and Cbis with 54, then I keep the one with the highest score which is **C** (12) In the **D,E** group, the longest is **D** with50`. so I keep only the row C and D here.

If I do the same for other Chrm I should then get the following output:

Seq Chrm start end  length score
C   C1   6     60   54 12
D   C1   70    120  50 12
A   C2   349   400  51 12
B   C2   450   500  50 12

Here is the dataframe in dput format if it can help :

structure(list(Seq = c("A", "B", "C", "Cbis", "D", "E", "F", 
"A", "B"), Chrm = c("C1", "C1", "C1", "C1", "C1", "C1", "C2", 
"C2", "C2"), start = c(1L, 3L, 6L, 6L, 70L, 78L, 350L, 349L, 
450L), end = c(50L, 55L, 60L, 60L, 120L, 111L, 400L, 400L, 500L
), length = c(49L, 52L, 54L, 54L, 50L, 33L, 50L, 51L, 50L), score = c(12L, 
12L, 12L, 11L, 12L, 12L, 12L, 12L, 12L)), class = "data.frame", row.names = c(NA, 
-9L))

CodePudding user response:

Using tidyverse functions:

library(tidyverse)

dat %>% 
  group_by(Chrm) %>% 
  arrange(start, end) %>% 
  group_by(cum = head(c(0, cumsum((end < lead(start)) | (end > lead(start) & start > lead(end)))), -1)) %>%
  arrange(desc(length, score)) %>% 
  slice_head(n = 1)

  Seq   Chrm  start   end length score   cum
  <chr> <chr> <int> <int>  <int> <int> <dbl>
1 C     C1        6    60     54    12     0
2 D     C1       70   120     50    12     1
3 A     C2      349   400     51    12     2
4 B     C2      450   500     50    12     3

CodePudding user response:

library(dplyr)
library(IRanges)

# Define overlapping ranges as a subgroup
ir <- IRanges(df$start, df$end)
df$subgroup <- subjectHits(findOverlaps(ir, reduce(ir)))

# Get max from each group and subgroup
df %>%
  group_by(Chrm, subgroup) %>%
  filter(length == max(length)) %>%
  filter(score == max(score))

# Install IRanges package
# https://bioconductor.org/packages/release/bioc/html/IRanges.html

result:

  Seq Chrm start end length score subgroup
1   C   C1     6  60     54    12        1
2   D   C1    70 120     50    12        2
3   A   C2   349 400     51    12        3
4   B   C2   450 500     50    12        4

Created on 2022-03-29 by the reprex package (v2.0.1)
  • Related