I have a large data frame that looks like this.
I want to group_by
seqnames and for each group, I want to check for overlapping ranges between the start and end.
If there is any overlapping range, then it should stay the row with the highest score.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(seqnames=rep(c("Chr1","Chr2"),each=3),
start=c(100,200,300,100,200,300),
end=c(150,400,500,120,220,320),
score=c(1000,500,1000,1000,1000,1000))
df
#> # A tibble: 6 × 4
#> seqnames start end score
#> <chr> <dbl> <dbl> <dbl>
#> 1 Chr1 100 150 1000
#> 2 Chr1 200 400 500
#> 3 Chr1 300 500 1000
#> 4 Chr2 100 120 1000
#> 5 Chr2 200 220 1000
#> 6 Chr2 300 320 1000
Created on 2022-12-27 with reprex v2.0.2
the desired output is
seqnames start end score
<chr> <dbl> <dbl> <dbl>
Chr1 100 150 1000
Chr1 300 500 1000
Chr2 100 120 1000
Chr2 200 220 1000
Chr2 300 320 1000
CodePudding user response:
You could use ivs
, see:
library(ivs)
df <- df %>% mutate(interval = iv(start, end))
df %>%
group_by(seqnames) %>%
mutate(interval_group = iv_identify_group(interval)) %>%
group_by(seqnames,interval_group) %>%
top_n(1,score) %>%
ungroup %>%
select(seqnames, start,end,score)
# A tibble: 5 × 4
seqnames start end score
<chr> <dbl> <dbl> <dbl>
1 Chr1 100 150 1000
2 Chr1 300 500 1000
3 Chr2 100 120 1000
4 Chr2 200 220 1000
5 Chr2 300 320 1000
CodePudding user response:
This works for your example. The last arrange call is only there to get the output in the same format as your desired output.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(seqnames=rep(c("Chr1","Chr2"),each=3),
start=c(100,200,300,100,200,300),
end=c(150,400,500,120,220,320),
score=c(1000,500,1000,1000,1000,1000))
df |>
group_by(seqnames) |>
arrange(start) |>
mutate(
remove = case_when(
end > lead(start) & score < lead(score) ~ TRUE,
start < lag(end) & score < lag(score) ~ TRUE,
TRUE ~ FALSE)
) |>
ungroup() |>
filter(!remove) |>
select(-remove) |>
arrange(seqnames, start)
#> # A tibble: 5 × 4
#> seqnames start end score
#> <chr> <dbl> <dbl> <dbl>
#> 1 Chr1 100 150 1000
#> 2 Chr1 300 500 1000
#> 3 Chr2 100 120 1000
#> 4 Chr2 200 220 1000
#> 5 Chr2 300 320 1000
Created on 2022-12-27 with reprex v2.0.2