I have a dataframe such as
Seq Chrm start end length score
0 A C1 1 50 49 12
1 B C1 3 55 52 12
2 C C1 6 60 54 12
3 Cbis C1 6 60 54 11
4 D C1 70 120 50 12
5 E C1 78 111 33 12
6 F C2 350 400 50 12
7 A C2 349 400 51 12
8 B C2 450 500 50 12
And I would like, within each specific Chrm
, to keep within each overlapping start
and end
the row with the longest length
value AND the highest Score
value.
For example in C1
:
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
Coordinates from start
to end
of A,B,C,Cbis together overlaps and D and E together overlaps.
In the A,B,C,Cbis group the longest are C and Cbis
with 54, then I keep the one with the highest score which is **C**
(12) In the **D,E** group, the longest is **D** with
50`.
so I keep only the row C and D here.
If I do the same for other Chrm I should then get the following output:
Seq Chrm start end length score
C C1 6 60 54 12
D C1 70 120 50 12
A C2 349 400 51 12
B C2 450 500 50 12
Here is the dataframe in dput format if it can help :
structure(list(Seq = c("A", "B", "C", "Cbis", "D", "E", "F",
"A", "B"), Chrm = c("C1", "C1", "C1", "C1", "C1", "C1", "C2",
"C2", "C2"), start = c(1L, 3L, 6L, 6L, 70L, 78L, 350L, 349L,
450L), end = c(50L, 55L, 60L, 60L, 120L, 111L, 400L, 400L, 500L
), length = c(49L, 52L, 54L, 54L, 50L, 33L, 50L, 51L, 50L), score = c(12L,
12L, 12L, 11L, 12L, 12L, 12L, 12L, 12L)), class = "data.frame", row.names = c(NA,
-9L))
CodePudding user response:
Using tidyverse
functions:
library(tidyverse)
dat %>%
group_by(Chrm) %>%
arrange(start, end) %>%
group_by(cum = head(c(0, cumsum((end < lead(start)) | (end > lead(start) & start > lead(end)))), -1)) %>%
arrange(desc(length, score)) %>%
slice_head(n = 1)
Seq Chrm start end length score cum
<chr> <chr> <int> <int> <int> <int> <dbl>
1 C C1 6 60 54 12 0
2 D C1 70 120 50 12 1
3 A C2 349 400 51 12 2
4 B C2 450 500 50 12 3
CodePudding user response:
library(dplyr)
library(IRanges)
# Define overlapping ranges as a subgroup
ir <- IRanges(df$start, df$end)
df$subgroup <- subjectHits(findOverlaps(ir, reduce(ir)))
# Get max from each group and subgroup
df %>%
group_by(Chrm, subgroup) %>%
filter(length == max(length)) %>%
filter(score == max(score))
# Install IRanges package
# https://bioconductor.org/packages/release/bioc/html/IRanges.html
result:
Seq Chrm start end length score subgroup
1 C C1 6 60 54 12 1
2 D C1 70 120 50 12 2
3 A C2 349 400 51 12 3
4 B C2 450 500 50 12 4
Created on 2022-03-29 by the reprex package (v2.0.1)