Home > OS >  Remove duicated rows where column A and B are swapped in the duplicate entry
Remove duicated rows where column A and B are swapped in the duplicate entry

Time:08-09

I have an issue where I need to remove duplicated table entries but the entries aren't exactly unique. There are 2 ids, id1 and id2. And in half of the files, these are swapped in id1 and id2. They are not sequential in the list and I am having trouble removing them.

I am having such a brain fart over this. I have included a part of the table as follows. You can see that the parts in bold should match as they are identical except for this. There are 2304 lines in the file without the header and 1152 are duplicated.

ID1 ID2 Sample Gene
S01-A01 S25-A01 BA159_0 AIM2
S01-A02 S25-A02 BA15_0 MMP8
S01-A17 S25-A17 BA144_0 SERPING1
S25-A01 S01-A01 BA159_0 AIM2

I tried messing around with for loops to check if the sample and gene were identical but my logic is off.

id_1 <- c("S01-A01", "S01-A03", "S01-A17", "S25-A01", "S29-A39", "S05-A39")
id_2 <- c("S25-A01", "S25-A02", "S25-A17", "S01-A01", "S05-A39", "S29-A39")
Sample <-c("BA159_0", "BA15_0", "BA144_0", "BA159_0", "BA183_0", "BA183_0")
Gene <- c("AIM2", "MMP8", "SERPING1", "AIM2", "S100A8", "S100A8")

df <- data.frame(id1 =id_1, 
            id2=id_2, 
            Sample=Sample, 
            Gene=Gene)
 
dropvec = c()
    idvec =c()
    for (i in 1:length(df$ID1))
    {
       for (j in 1:length(df$ID2))
       {
          if(df$Sample[i] == df$Sample[j] && df$Gene[i] == df$Gene[j] && i != j)
          {
             idvec
             if(df$ID1[i] %in% idvec)
             {
                print(paste("ID ",df$ID1[i], " is in idvec"))
             }
             else {
                print("ID is NOT in idvec")
                idvec=c(idvec, df$ID2[j], df$ID2[i])
                dropvec = c(dropvec, j)
                
             }
    
          }
       }   
    }

I would appreciate any help on this. Thanks. I updated my code to have the data frame in it based on advice. Thanks.

CodePudding user response:

A base solution:

df[!duplicated(cbind(t(apply(df[1:2], 1, sort)), df[-(1:2)])), ]

#       ID1     ID2  Sample     Gene
# 1 S01-A01 S25-A01 BA159_0     AIM2
# 2 S01-A02 S25-A02  BA15_0     MMP8
# 3 S01-A17 S25-A17 BA144_0 SERPING1

CodePudding user response:

How about this, based on the tidyverse...

library(tidyverse)

d <- read.table(textConnection("ID1     ID2     Sample  Gene
S01-A01     S25-A01     BA159_0     AIM2
S01-A02     S25-A02     BA15_0  MMP8
S01-A17     S25-A17     BA144_0     SERPING1
S25-A01     S01-A01     BA159_0     AIM2"), header=TRUE)

d %>% 
  # Ensure IDs are in a consistent order
  mutate(
    T1=ifelse(ID1 < ID2, ID1, ID2),
    T2=ifelse(ID1 < ID2, ID2, ID1)
  ) %>% 
  # Remove redundant ID columns
  select(-ID1, -ID2) %>% 
  # Ensure uniqueness
  unique() %>% 
  # Restore original IDs
  rename(ID1=T1, ID2=T2)
  Sample     Gene     ID1     ID2
1 BA159_0     AIM2 S01-A01 S25-A01
2  BA15_0     MMP8 S01-A02 S25-A02
3 BA144_0 SERPING1 S01-A17 S25-A17

@gaut's comment about providing a minimal reproducible example is apposite. (Here, we needed a version of your test data provided by dput().) But you caught me on a good day.

  • Related