I have a huge data frame that looks like this.
I want to group_by(chr)
, and then for each chr
to find
- Is any range1 (start1, end1), within the chr group, overlapping with any range2 (start2,end2)?
library(dplyr)
df1 <- tibble(chr=c(1,1,2,2),
start1=c(100,200,100,200),
end1=c(150,400,150,400),
species=c("Penguin"),
start2=c(200,200,500,1000),
end2=c(250,240,1000,2000)
)
df1
#> # A tibble: 4 × 6
#> chr start1 end1 species start2 end2
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 100 150 Penguin 200 250
#> 2 1 200 400 Penguin 200 240
#> 3 2 100 150 Penguin 500 1000
#> 4 2 200 400 Penguin 1000 2000
Created on 2023-01-05 with reprex v2.0.2
I want my data to look like this
# A tibble: 4 × 6
chr start1 end1 species start2 end2 OVERLAP
1 100 150 Penguin 200 250 TRUE
1 200 400 Penguin 200 240 TRUE
2 100 150 Penguin 500 1000 FALSE
2 200 400 Penguin 1000 2000 FALSE
I have fought a lot with the ivs
package and iv_overlaps
with no success in getting what I want.
Major EDIT:
When I am applying any of the codes in real data I am not getting the results I want, and I am so confused. Why?
data <- tibble::tribble(
~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,
"Chr2", 2739, 2840, " ", "A", 740, 1739,
"Chr2", 12577, 12678, " ", "B", 10578, 11577,
"Chr2", 22431, 22532, " ", "C", 20432, 21431,
"Chr2", 32202, 32303, " ", "D", 30203, 31202,
"Chr2", 42024, 42125, " ", "E", 40025, 41024,
"Chr2", 51830, 51931, " ", "F", 49831, 50830,
"Chr2", 82061, 84742, " ", "G", 80062, 81061,
"Chr2", 84811, 86692, " ", "H", 82812, 83811,
"Chr2", 86782, 88106, "-", "I", 88107, 89106,
"Chr2", 139454, 139555, " ", "J", 137455, 138454,
)
data %>%
group_by(chr) %>%
mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))
then It gives as an output
chr start1 end1 strand gene start2 end2 overlap
<chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <lgl>
1 Chr2 2739 2840 A 740 1739 TRUE
2 Chr2 12577 12678 B 10578 11577 TRUE
3 Chr2 22431 22532 C 20432 21431 TRUE
4 Chr2 32202 32303 D 30203 31202 TRUE
5 Chr2 42024 42125 E 40025 41024 TRUE
6 Chr2 51830 51931 F 49831 50830 TRUE
7 Chr2 82061 84742 G 80062 81061 TRUE
8 Chr2 84811 86692 H 82812 83811 TRUE
9 Chr2 86782 88106 - I 88107 89106 TRUE
10 Chr2 139454 139555 J 137455 138454 TRUE
Which is wrong. They might be indirect matches, but there there is not a direct overlap.
CodePudding user response:
You can use iv_overlaps
like so, which will output TRUE
even if the overlap is on a different column. (I modified your dataframe to reflect it).
library(ivs)
library(dplyr)
df1 %>%
group_by(chr) %>%
mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))
output
# A tibble: 4 × 7
# Groups: chr [2]
chr start1 end1 species start2 end2 overlap
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>
1 1 100 150 Penguin 200 250 TRUE
2 1 200 400 Penguin 0 50 TRUE
3 2 100 150 Penguin 500 1000 FALSE
4 2 200 400 Penguin 1000 2000 FALSE
CodePudding user response:
The condition to determine whether two ranges overlap is
start1 <= end2 & end1 >= start2
library(dplyr)
df1 %>%
group_by(chr) %>%
mutate(OVERLAP = any(start1 <= end2 & end1 >= start2)) %>%
ungroup()
# # A tibble: 4 × 7
# chr start1 end1 species start2 end2 OVERLAP
# <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>
# 1 1 100 150 Penguin 200 250 TRUE
# 2 1 200 400 Penguin 200 240 TRUE
# 3 2 100 150 Penguin 500 1000 FALSE
# 4 2 200 400 Penguin 1000 2000 FALSE
If the intervals are directed, i.e. end
can be less than start
, then you need to do sorting before determine overlaps.
df1 %>%
group_by(chr) %>%
mutate(OVERLAP = any(pmin(start1, end1) <= pmax(start2, end2) &
pmax(start1, end1) >= pmin(start2, end2)))
Furthermore, if you want to check if an interval (start1, end1)
overlaps any of the intervals (start2, end2)
, as which ivs::iv_overlaps()
works, then you can implement it with purrr::map2
.
df1 %>%
group_by(chr) %>%
mutate(OVERLAP = any(
purrr::map2_lgl(start1, end1,
~ any(min(.x, .y) <= pmax(start2, end2) &
max(.x, .y) >= pmin(start2, end2)))
))
CodePudding user response:
If you want to check whether the overlap occurs in either direction, you need:
df1 %>%
group_by(chr) %>%
mutate(overlap = (max(end1) > min(start2) & min(start2) > min(start1))|
(max(end2) > min(start1) & min(start1) > min(start2)))
#> # A tibble: 4 x 7
#> # Groups: chr [2]
#> chr start1 end1 species start2 end2 overlap
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>
#> 1 1 100 150 Penguin 200 250 TRUE
#> 2 1 200 400 Penguin 200 240 TRUE
#> 3 2 100 150 Penguin 500 1000 FALSE
#> 4 2 200 400 Penguin 1000 2000 FALSE
Created on 2023-01-05 with reprex v2.0.2
CodePudding user response:
If your definition of overlap is not overlap as in Darren's answer https://stackoverflow.com/a/75021631/11732165 but containment ((start1 >= start2 & end1 <= end2) | (start2 >= start1 & end2 <= end1))
then take the answer and use the logic you want.
I use a cross join to make sure you compare all rows under the same chr
.
Unfortunately there IS undeniably a full containment in your test data -
chr start1 end1 strand gene start2 end2 overlap
7 Chr2 82061 84742 G 80062 81061 TRUE
8 Chr2 84811 86692 H 82812 83811 TRUE
[start2, end2] for H is contained in [start1, end1] for G.
Code (note that performance will degrade rapidly if there are a lot of records under a single chr
- over 200 is likely to be intolerable, and you'll want an implementation that doesn't involve a self-cross.
check_overlap = function(df){
df %>% mutate(temp_id = 1:nrow(df)) %>%
inner_join(., ., by='chr') %>%
filter(temp_id.x != temp_id.y) %>%
mutate(overlaps = start1.x <= end2.y & end1.x >= start2.y) %>%
group_by(chr) %>%
summarise(OVERLAP = any(overlaps)) %>%
inner_join(df, by = 'chr')
}
check_containment = function(df){
df %>% mutate(temp_id = 1:nrow(df)) %>%
inner_join(., ., by='chr') %>%
filter(temp_id.x != temp_id.y) %>%
mutate(overlaps = (start1.x >= start2.y & end1.x <= end2.y) | (start2.y >= start1.x & end2.y <= end1.x)) %>%
group_by(chr) %>%
summarise(OVERLAP = any(overlaps)) %>%
inner_join(df, by = 'chr')
}