I have a data frame with chromosomal coordinates, peak intersects, and other relevant information, for a total of 127,471 rows and 11 columns (abbreviated table below).
UnionChr | UnionStart | UnionEnd | IntersectChr | IntersectStart | IntersectEnd | IntersectName | Overlap | Genotype |
---|---|---|---|---|---|---|---|---|
chr1 | 3657227 | 3658092 | . | -1 | -1 | . | 0 | WT |
chr1 | 3657227 | 3658092 | chr1 | 3657227 | 3658092 | dko_k27_peak_1 | 865 | DKO |
chr1 | 3658443 | 3664519 | chr1 | 3658443 | 3662838 | wt_k27_peak_1 | 4395 | WT |
chr1 | 3658443 | 3664519 | chr1 | 3663340 | 3664519 | wt_k27_peak_2 | 1179 | WT |
chr1 | 3658443 | 3664519 | chr1 | 3658833 | 3664156 | dko_k27_peak_2 | 5323 | DKO |
chr1 | 3665636 | 3666032 | chr1 | 3665705 | 3666032 | wt_k27_peak_3 | 327 | WT |
chr1 | 3665636 | 3666032 | chr1 | 3665636 | 3665919 | dko_k27_peak_3 | 283 | DKO |
chr1 | 4468858 | 4469245 | chr1 | 4468858 | 4469245 | wt_k27_peak_4 | 387 | WT |
chr1 | 4468858 | 4469245 | . | -1 | -1 | . | 0 | DKO |
chr1 | 4472410 | 4473380 | . | -1 | -1 | . | 0 | WT |
chr1 | 4472410 | 4473380 | chr1 | 4472410 | 4473380 | dko_k27_peak_4 | 970 | DKO |
I want to remove any rows that have a unique peak intersect. A unique intersect is a row that has the same UnionChr, UnionStart and UnionEnd position as the row below it, but one row has an Overlap value of 0. For example, rows 1 and 2, 8 and 9, and 10 and 11 are examples of rows that are unique peak intersects that I would want to remove. I want to retain all other rows. In this case, rows 3-7.
I am having trouble finding ways to eliminate the unique rows. I have tried subsetting, for loops, using duplicated() and unique(). I know this would be easier in python, but I don't know python and I only need to do this once.
Thank you! Below is code to make the data frame:
df <- data.frame(UnionChr=c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
UnionStart = c(3657227, 3657227, 3658443, 3658443, 3658443, 3665636, 3665636, 4468858, 4468858, 4472410, 4472410),
UnionEnd = c(3658092, 3658092, 3664519, 3664519, 3664519, 3666032, 3666032, 4469245, 4469245, 4473380, 4473380),
IntersectChr = c("." , "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", ".", ".", "chr1"),
IntersectStart = c(-1, 3657227, 3658443, 3663340, 3658833, 3665705, 3665636, 4468858, -1 , -1, 4472410),
IntersectEnd = c( -1, 3658092, 3662838, 3664519, 3664156, 3666032, 3665919, 4469245, -1, -1, 4473380),
IntersectName = c(".", "dko_k27_peak_1", "wt_k27_peak_1", "wt_k27_peak_2", "dko_k27_peak_2", "wt_k27_peak_3", "dko_k27_peak_3", "wt_k27_peak_4", ".", ".", "dko_k27_peak_4"),
Overlap = c(0, 865, 4395, 1179, 5323, 327, 283, 387, 0, 0, 970),
Genotype = c("WT", "DKO", "WT", "WT", "DKO", "WT", "DKO", "WT", "DKO", "WT", "DKO"))
CodePudding user response:
If you want to compare with both the row below and the row above (which your example suggest), I think you are looking for something like this. Your description is a little vague, so please do confirm this is doing what you think it should be doing.
Instead of building one very long filtering condition, I've broken it down in a few steps. This is a bit more readable, and you can look at those new columns to check each part of the filtering condition to make sure it is doing what you expect.
library(dplyr)
filtered_df <- df %>%
group_by(UnionChr) %>%
mutate(
same_start = UnionStart == lead(UnionStart) | UnionStart == lag(UnionStart),
same_end = UnionEnd == lead(UnionEnd) | UnionEnd == lag(UnionEnd),
zero_overlap = (Overlap == 0 | lead(Overlap == 0) | lag(Overlap == 0)),
combined = !(same_start & same_end & zero_overlap)
) %>%
filter(combined) %>%
select(-(same_start:combined))
Result:
# A tibble: 5 × 9 # Groups: UnionChr [1] UnionChr UnionStart UnionEnd IntersectChr IntersectStart IntersectEnd IntersectName Overlap <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> 1 chr1 3658443 3664519 chr1 3658443 3662838 wt_k27_peak_1 4395 2 chr1 3658443 3664519 chr1 3663340 3664519 wt_k27_peak_2 1179 3 chr1 3658443 3664519 chr1 3658833 3664156 dko_k27_peak_2 5323 4 chr1 3665636 3666032 chr1 3665705 3666032 wt_k27_peak_3 327 5 chr1 3665636 3666032 chr1 3665636 3665919 dko_k27_peak_3 283 # … with 1 more variable: Genotype <chr>
EDIT:
Based on your comment, I think it's actually easier. Find the groups of rows with the same Union*
variables, then check if any of the Overlap
s within that group is 0. If so, toss the whole group.
df %>%
group_by(UnionChr, UnionStart, UnionEnd) %>%
filter(!any(Overlap == 0))