I have an issue where I need to remove duplicated table entries but the entries aren't exactly unique. There are 2 ids, id1 and id2. And in half of the files, these are swapped in id1 and id2. They are not sequential in the list and I am having trouble removing them.
I am having such a brain fart over this. I have included a part of the table as follows. You can see that the parts in bold should match as they are identical except for this. There are 2304 lines in the file without the header and 1152 are duplicated.
ID1 | ID2 | Sample | Gene |
---|---|---|---|
S01-A01 | S25-A01 | BA159_0 | AIM2 |
S01-A02 | S25-A02 | BA15_0 | MMP8 |
S01-A17 | S25-A17 | BA144_0 | SERPING1 |
S25-A01 | S01-A01 | BA159_0 | AIM2 |
I tried messing around with for loops to check if the sample and gene were identical but my logic is off.
id_1 <- c("S01-A01", "S01-A03", "S01-A17", "S25-A01", "S29-A39", "S05-A39")
id_2 <- c("S25-A01", "S25-A02", "S25-A17", "S01-A01", "S05-A39", "S29-A39")
Sample <-c("BA159_0", "BA15_0", "BA144_0", "BA159_0", "BA183_0", "BA183_0")
Gene <- c("AIM2", "MMP8", "SERPING1", "AIM2", "S100A8", "S100A8")
df <- data.frame(id1 =id_1,
id2=id_2,
Sample=Sample,
Gene=Gene)
dropvec = c()
idvec =c()
for (i in 1:length(df$ID1))
{
for (j in 1:length(df$ID2))
{
if(df$Sample[i] == df$Sample[j] && df$Gene[i] == df$Gene[j] && i != j)
{
idvec
if(df$ID1[i] %in% idvec)
{
print(paste("ID ",df$ID1[i], " is in idvec"))
}
else {
print("ID is NOT in idvec")
idvec=c(idvec, df$ID2[j], df$ID2[i])
dropvec = c(dropvec, j)
}
}
}
}
I would appreciate any help on this. Thanks. I updated my code to have the data frame in it based on advice. Thanks.
CodePudding user response:
A base
solution:
df[!duplicated(cbind(t(apply(df[1:2], 1, sort)), df[-(1:2)])), ]
# ID1 ID2 Sample Gene
# 1 S01-A01 S25-A01 BA159_0 AIM2
# 2 S01-A02 S25-A02 BA15_0 MMP8
# 3 S01-A17 S25-A17 BA144_0 SERPING1
CodePudding user response:
How about this, based on the tidyverse...
library(tidyverse)
d <- read.table(textConnection("ID1 ID2 Sample Gene
S01-A01 S25-A01 BA159_0 AIM2
S01-A02 S25-A02 BA15_0 MMP8
S01-A17 S25-A17 BA144_0 SERPING1
S25-A01 S01-A01 BA159_0 AIM2"), header=TRUE)
d %>%
# Ensure IDs are in a consistent order
mutate(
T1=ifelse(ID1 < ID2, ID1, ID2),
T2=ifelse(ID1 < ID2, ID2, ID1)
) %>%
# Remove redundant ID columns
select(-ID1, -ID2) %>%
# Ensure uniqueness
unique() %>%
# Restore original IDs
rename(ID1=T1, ID2=T2)
Sample Gene ID1 ID2
1 BA159_0 AIM2 S01-A01 S25-A01
2 BA15_0 MMP8 S01-A02 S25-A02
3 BA144_0 SERPING1 S01-A17 S25-A17
@gaut's comment about providing a minimal reproducible example is apposite. (Here, we needed a version of your test data provided by dput()
.) But you caught me on a good day.