Find which column ranges overlap after grouping in R-CodePudding

I have a huge data frame that looks like this.

I want to group_by(chr), and then for each chr to find

Is any range1 (start1, end1), within the chr group, overlapping with any range2 (start2,end2)?

library(dplyr)

df1 <- tibble(chr=c(1,1,2,2),
               start1=c(100,200,100,200),
               end1=c(150,400,150,400),
       species=c("Penguin"), 
       start2=c(200,200,500,1000), 
       end2=c(250,240,1000,2000)
       )

df1
#> # A tibble: 4 × 6
#>     chr start1  end1 species start2  end2
#>   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl>
#> 1     1    100   150 Penguin    200   250
#> 2     1    200   400 Penguin    200   240
#> 3     2    100   150 Penguin    500  1000
#> 4     2    200   400 Penguin   1000  2000

^{Created on 2023-01-05 with reprex v2.0.2}

I want my data to look like this

# A tibble: 4 × 6
        chr start1  end1 species start2  end2 OVERLAP
         1    100   150 Penguin    200   250    TRUE
         1    200   400 Penguin    200   240    TRUE
         2    100   150 Penguin    500  1000    FALSE
         2    200   400 Penguin   1000  2000    FALSE

I have fought a lot with the ivs package and iv_overlaps with no success in getting what I want.

Major EDIT:

When I am applying any of the codes in real data I am not getting the results I want, and I am so confused. Why?

data <- tibble::tribble(
  ~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,
  "Chr2",   2739,   2840, " ", "A",    740,   1739,
  "Chr2",  12577,  12678, " ", "B",  10578,  11577,
  "Chr2",  22431,  22532, " ", "C",  20432,  21431,
  "Chr2",  32202,  32303, " ", "D",  30203,  31202,
  "Chr2",  42024,  42125, " ", "E",  40025,  41024,
  "Chr2",  51830,  51931, " ", "F",  49831,  50830,
  "Chr2",  82061,  84742, " ", "G",  80062,  81061,
  "Chr2",  84811,  86692, " ", "H",  82812,  83811,
  "Chr2",  86782,  88106, "-", "I",  88107,  89106,
  "Chr2", 139454, 139555, " ", "J", 137455, 138454,
  )

data %>% 
  group_by(chr) %>% 
  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))

then It gives as an output

 chr   start1   end1 strand gene  start2   end2 overlap
   <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl>  <dbl> <lgl>  
 1 Chr2    2739   2840        A        740   1739 TRUE   
 2 Chr2   12577  12678        B      10578  11577 TRUE   
 3 Chr2   22431  22532        C      20432  21431 TRUE   
 4 Chr2   32202  32303        D      30203  31202 TRUE   
 5 Chr2   42024  42125        E      40025  41024 TRUE   
 6 Chr2   51830  51931        F      49831  50830 TRUE   
 7 Chr2   82061  84742        G      80062  81061 TRUE   
 8 Chr2   84811  86692        H      82812  83811 TRUE   
 9 Chr2   86782  88106 -      I      88107  89106 TRUE   
10 Chr2  139454 139555        J     137455 138454 TRUE

Which is wrong. They might be indirect matches, but there there is not a direct overlap.

CodePudding user response：

You can use iv_overlaps like so, which will output TRUE even if the overlap is on a different column. (I modified your dataframe to reflect it).

library(ivs)
library(dplyr)
df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))

output

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 TRUE   
2     1    200   400 Penguin      0    50 TRUE   
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE

CodePudding user response：

The condition to determine whether two ranges overlap is

start1 <= end2 & end1 >= start2

library(dplyr)

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(start1 <= end2 & end1 >= start2)) %>%
  ungroup()

# # A tibble: 4 × 7
#     chr start1  end1 species start2  end2 OVERLAP
#   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
# 1     1    100   150 Penguin    200   250 TRUE   
# 2     1    200   400 Penguin    200   240 TRUE   
# 3     2    100   150 Penguin    500  1000 FALSE  
# 4     2    200   400 Penguin   1000  2000 FALSE

If the intervals are directed, i.e. end can be less than start, then you need to do sorting before determine overlaps.

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(pmin(start1, end1) <= pmax(start2, end2) &
                       pmax(start1, end1) >= pmin(start2, end2)))

Furthermore, if you want to check if an interval (start1, end1) overlaps any of the intervals (start2, end2), as which ivs::iv_overlaps() works, then you can implement it with purrr::map2.

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(
    purrr::map2_lgl(start1, end1,
                    ~ any(min(.x, .y) <= pmax(start2, end2) &
                          max(.x, .y) >= pmin(start2, end2)))
  ))

CodePudding user response：

If you want to check whether the overlap occurs in either direction, you need:

df1 %>%
  group_by(chr) %>%
  mutate(overlap = (max(end1) > min(start2) & min(start2) > min(start1))|
                   (max(end2) > min(start1) & min(start1) > min(start2))) 
#> # A tibble: 4 x 7
#> # Groups:   chr [2]
#>     chr start1  end1 species start2  end2 overlap
#>   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
#> 1     1    100   150 Penguin    200   250 TRUE   
#> 2     1    200   400 Penguin    200   240 TRUE   
#> 3     2    100   150 Penguin    500  1000 FALSE  
#> 4     2    200   400 Penguin   1000  2000 FALSE

^{Created on 2023-01-05 with reprex v2.0.2}

CodePudding user response：

If your definition of overlap is not overlap as in Darren's answer https://stackoverflow.com/a/75021631/11732165 but containment ((start1 >= start2 & end1 <= end2) | (start2 >= start1 & end2 <= end1)) then take the answer and use the logic you want.

I use a cross join to make sure you compare all rows under the same chr.

Unfortunately there IS undeniably a full containment in your test data -

 chr   start1   end1 strand gene  start2   end2 overlap
 7 Chr2   82061  84742        G      80062  81061 TRUE   
 8 Chr2   84811  86692        H      82812  83811 TRUE

[start2, end2] for H is contained in [start1, end1] for G.

Code (note that performance will degrade rapidly if there are a lot of records under a single chr - over 200 is likely to be intolerable, and you'll want an implementation that doesn't involve a self-cross.

check_overlap = function(df){
  df %>% mutate(temp_id = 1:nrow(df)) %>%
    inner_join(., ., by='chr') %>%
    filter(temp_id.x != temp_id.y) %>%
    mutate(overlaps = start1.x <= end2.y & end1.x >= start2.y) %>%
    group_by(chr) %>%
    summarise(OVERLAP = any(overlaps)) %>%
    inner_join(df, by = 'chr')
}

check_containment = function(df){
  df %>% mutate(temp_id = 1:nrow(df)) %>%
    inner_join(., ., by='chr') %>%
    filter(temp_id.x != temp_id.y) %>%
    mutate(overlaps = (start1.x >= start2.y & end1.x <= end2.y) | (start2.y >= start1.x & end2.y <= end1.x)) %>%
    group_by(chr) %>%
    summarise(OVERLAP = any(overlaps)) %>%
    inner_join(df, by = 'chr')
}