Home > Mobile >  Remove hits of overlapping ranges based on another column in R
Remove hits of overlapping ranges based on another column in R

Time:12-28

I have a large data frame that looks like this.

I want to group_by seqnames and for each group, I want to check for overlapping ranges between the start and end. If there is any overlapping range, then it should stay the row with the highest score.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- tibble(seqnames=rep(c("Chr1","Chr2"),each=3),
       start=c(100,200,300,100,200,300),
       end=c(150,400,500,120,220,320),
       score=c(1000,500,1000,1000,1000,1000))

df
#> # A tibble: 6 × 4
#>   seqnames start   end score
#>   <chr>    <dbl> <dbl> <dbl>
#> 1 Chr1       100   150  1000
#> 2 Chr1       200   400   500
#> 3 Chr1       300   500  1000
#> 4 Chr2       100   120  1000
#> 5 Chr2       200   220  1000
#> 6 Chr2       300   320  1000

Created on 2022-12-27 with reprex v2.0.2

the desired output is

  seqnames start   end score
  <chr>    <dbl> <dbl> <dbl>
 Chr1       100   150  1000
 Chr1       300   500  1000
 Chr2       100   120  1000
 Chr2       200   220  1000
 Chr2       300   320  1000

CodePudding user response:

You could use ivs, see:

library(ivs)

df <- df %>% mutate(interval = iv(start, end))

df %>%
  group_by(seqnames) %>%
  mutate(interval_group = iv_identify_group(interval)) %>%
  group_by(seqnames,interval_group) %>%
  top_n(1,score) %>% 
  ungroup %>%
  select(seqnames, start,end,score)


# A tibble: 5 × 4
  seqnames start   end score
  <chr>    <dbl> <dbl> <dbl>
1 Chr1       100   150  1000
2 Chr1       300   500  1000
3 Chr2       100   120  1000
4 Chr2       200   220  1000
5 Chr2       300   320  1000

CodePudding user response:

This works for your example. The last arrange call is only there to get the output in the same format as your desired output.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(seqnames=rep(c("Chr1","Chr2"),each=3),
             start=c(100,200,300,100,200,300),
             end=c(150,400,500,120,220,320),
             score=c(1000,500,1000,1000,1000,1000))


df |> 
  group_by(seqnames) |> 
  arrange(start) |> 
  mutate(
    remove = case_when(
      end > lead(start)  & score < lead(score) ~ TRUE,
      start < lag(end)  & score < lag(score) ~ TRUE,
      TRUE ~ FALSE)
    ) |>
  ungroup() |> 
    filter(!remove) |> 
  select(-remove) |>
  arrange(seqnames, start)
#> # A tibble: 5 × 4
#>   seqnames start   end score
#>   <chr>    <dbl> <dbl> <dbl>
#> 1 Chr1       100   150  1000
#> 2 Chr1       300   500  1000
#> 3 Chr2       100   120  1000
#> 4 Chr2       200   220  1000
#> 5 Chr2       300   320  1000

Created on 2022-12-27 with reprex v2.0.2

  • Related