Remove hits of overlapping ranges based on another column in R-CodePudding

I have a large data frame that looks like this.

I want to group_by seqnames and for each group, I want to check for overlapping ranges between the start and end. If there is any overlapping range, then it should stay the row with the highest score.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- tibble(seqnames=rep(c("Chr1","Chr2"),each=3),
       start=c(100,200,300,100,200,300),
       end=c(150,400,500,120,220,320),
       score=c(1000,500,1000,1000,1000,1000))

df
#> # A tibble: 6 × 4
#>   seqnames start   end score
#>   <chr>    <dbl> <dbl> <dbl>
#> 1 Chr1       100   150  1000
#> 2 Chr1       200   400   500
#> 3 Chr1       300   500  1000
#> 4 Chr2       100   120  1000
#> 5 Chr2       200   220  1000
#> 6 Chr2       300   320  1000

^{Created on 2022-12-27 with reprex v2.0.2}

the desired output is

  seqnames start   end score
  <chr>    <dbl> <dbl> <dbl>
 Chr1       100   150  1000
 Chr1       300   500  1000
 Chr2       100   120  1000
 Chr2       200   220  1000
 Chr2       300   320  1000

CodePudding user response：

You could use ivs, see:

library(ivs)

df <- df %>% mutate(interval = iv(start, end))

df %>%
  group_by(seqnames) %>%
  mutate(interval_group = iv_identify_group(interval)) %>%
  group_by(seqnames,interval_group) %>%
  top_n(1,score) %>% 
  ungroup %>%
  select(seqnames, start,end,score)


# A tibble: 5 × 4
  seqnames start   end score
  <chr>    <dbl> <dbl> <dbl>
1 Chr1       100   150  1000
2 Chr1       300   500  1000
3 Chr2       100   120  1000
4 Chr2       200   220  1000
5 Chr2       300   320  1000

CodePudding user response：

This works for your example. The last arrange call is only there to get the output in the same format as your desired output.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(seqnames=rep(c("Chr1","Chr2"),each=3),
             start=c(100,200,300,100,200,300),
             end=c(150,400,500,120,220,320),
             score=c(1000,500,1000,1000,1000,1000))


df |> 
  group_by(seqnames) |> 
  arrange(start) |> 
  mutate(
    remove = case_when(
      end > lead(start)  & score < lead(score) ~ TRUE,
      start < lag(end)  & score < lag(score) ~ TRUE,
      TRUE ~ FALSE)
    ) |>
  ungroup() |> 
    filter(!remove) |> 
  select(-remove) |>
  arrange(seqnames, start)
#> # A tibble: 5 × 4
#>   seqnames start   end score
#>   <chr>    <dbl> <dbl> <dbl>
#> 1 Chr1       100   150  1000
#> 2 Chr1       300   500  1000
#> 3 Chr2       100   120  1000
#> 4 Chr2       200   220  1000
#> 5 Chr2       300   320  1000

^{Created on 2022-12-27 with reprex v2.0.2}