Home > Back-end >  How do I remove unique rows based on certain column and row information in a data frame in R
How do I remove unique rows based on certain column and row information in a data frame in R

Time:06-10

I have a data frame with chromosomal coordinates, peak intersects, and other relevant information, for a total of 127,471 rows and 11 columns (abbreviated table below).

UnionChr UnionStart UnionEnd IntersectChr IntersectStart IntersectEnd IntersectName Overlap Genotype
chr1 3657227 3658092 . -1 -1 . 0 WT
chr1 3657227 3658092 chr1 3657227 3658092 dko_k27_peak_1 865 DKO
chr1 3658443 3664519 chr1 3658443 3662838 wt_k27_peak_1 4395 WT
chr1 3658443 3664519 chr1 3663340 3664519 wt_k27_peak_2 1179 WT
chr1 3658443 3664519 chr1 3658833 3664156 dko_k27_peak_2 5323 DKO
chr1 3665636 3666032 chr1 3665705 3666032 wt_k27_peak_3 327 WT
chr1 3665636 3666032 chr1 3665636 3665919 dko_k27_peak_3 283 DKO
chr1 4468858 4469245 chr1 4468858 4469245 wt_k27_peak_4 387 WT
chr1 4468858 4469245 . -1 -1 . 0 DKO
chr1 4472410 4473380 . -1 -1 . 0 WT
chr1 4472410 4473380 chr1 4472410 4473380 dko_k27_peak_4 970 DKO

I want to remove any rows that have a unique peak intersect. A unique intersect is a row that has the same UnionChr, UnionStart and UnionEnd position as the row below it, but one row has an Overlap value of 0. For example, rows 1 and 2, 8 and 9, and 10 and 11 are examples of rows that are unique peak intersects that I would want to remove. I want to retain all other rows. In this case, rows 3-7.

I am having trouble finding ways to eliminate the unique rows. I have tried subsetting, for loops, using duplicated() and unique(). I know this would be easier in python, but I don't know python and I only need to do this once.

Thank you! Below is code to make the data frame:

df <- data.frame(UnionChr=c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"), 
                  UnionStart = c(3657227, 3657227, 3658443, 3658443, 3658443, 3665636, 3665636, 4468858, 4468858, 4472410, 4472410), 
                  UnionEnd = c(3658092, 3658092, 3664519, 3664519, 3664519, 3666032, 3666032, 4469245, 4469245, 4473380, 4473380), 
                  IntersectChr = c("." , "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", ".",  ".", "chr1"), 
                  IntersectStart = c(-1, 3657227, 3658443, 3663340, 3658833, 3665705, 3665636, 4468858, -1 , -1, 4472410), 
                  IntersectEnd = c( -1, 3658092, 3662838, 3664519, 3664156, 3666032, 3665919, 4469245, -1, -1, 4473380), 
                  IntersectName = c(".", "dko_k27_peak_1", "wt_k27_peak_1", "wt_k27_peak_2",  "dko_k27_peak_2", "wt_k27_peak_3", "dko_k27_peak_3", "wt_k27_peak_4",  ".", ".", "dko_k27_peak_4"), 
                  Overlap = c(0, 865, 4395, 1179, 5323, 327, 283, 387, 0, 0, 970), 
                  Genotype = c("WT", "DKO", "WT", "WT", "DKO", "WT", "DKO", "WT", "DKO", "WT", "DKO"))

CodePudding user response:

If you want to compare with both the row below and the row above (which your example suggest), I think you are looking for something like this. Your description is a little vague, so please do confirm this is doing what you think it should be doing.

Instead of building one very long filtering condition, I've broken it down in a few steps. This is a bit more readable, and you can look at those new columns to check each part of the filtering condition to make sure it is doing what you expect.

library(dplyr)    

filtered_df <- df %>% 
  group_by(UnionChr) %>% 
  mutate(
    same_start = UnionStart == lead(UnionStart) | UnionStart == lag(UnionStart),
    same_end   = UnionEnd   == lead(UnionEnd)   | UnionEnd   == lag(UnionEnd),
    zero_overlap = (Overlap == 0 | lead(Overlap == 0) | lag(Overlap == 0)),
    combined = !(same_start & same_end & zero_overlap)
  ) %>% 
  filter(combined) %>% 
  select(-(same_start:combined))

Result:

# A tibble: 5 × 9
# Groups:   UnionChr [1]
  UnionChr UnionStart UnionEnd IntersectChr IntersectStart IntersectEnd IntersectName  Overlap
  <chr>         <dbl>    <dbl> <chr>                 <dbl>        <dbl> <chr>            <dbl>
1 chr1        3658443  3664519 chr1                3658443      3662838 wt_k27_peak_1     4395
2 chr1        3658443  3664519 chr1                3663340      3664519 wt_k27_peak_2     1179
3 chr1        3658443  3664519 chr1                3658833      3664156 dko_k27_peak_2    5323
4 chr1        3665636  3666032 chr1                3665705      3666032 wt_k27_peak_3      327
5 chr1        3665636  3666032 chr1                3665636      3665919 dko_k27_peak_3     283
# … with 1 more variable: Genotype <chr>

EDIT:

Based on your comment, I think it's actually easier. Find the groups of rows with the same Union* variables, then check if any of the Overlaps within that group is 0. If so, toss the whole group.

df %>% 
  group_by(UnionChr, UnionStart, UnionEnd) %>% 
  filter(!any(Overlap == 0))
  • Related