I have a dataframe where several values are duplicated across different rows/columns. I would like to overwrite every cell that occurrs again somewhere in the df with NA. So far I have tried a for loop:
test1 <- c("a","b")
test2 <- c("d","a")
dft <- data.frame(rbind(test1, test2))
l=0
for(i in dft$X1){
l <- l 1
j=0
for(k in dft$X2){
j<-j 1
print(i)
print(k)
ifelse(k==i, dft$X1[l]<-NA, dft$X1[l] <- i)
ifelse(k==i, dft$X2[j]<-NA, dft$X2[j]<-k)
}
}
which yields
X1 X2
test1 <NA> b
test2 d <NA>
...perfect... but when I expand my df to
test1 <- c("a","b")
test2 <- c("c","a")
test3 <- c("e","a")
test4 <- c("g","h")
dft <- data.frame(rbind(test1, test2, test3, test4))
> dft
X1 X2
test1 a b
test2 c a
test3 e a
test4 g h
and run the same script (to set all "a" NA) it yields
> dft
X1 X2
test1 a b
test2 c <NA>
test3 e <NA>
test4 g h
why does that happen? and is there an easier way to set all duplicated cells NA?
CodePudding user response:
Use the following script:
na_dups <- function(df){
idx <- duplicated(unlist(df)) | duplicated(unlist(df), fromLast = TRUE)
is.na(df) <- array(idx, dim(df))
df
}
na_dups(dft)
X1 X2
test1 <NA> b
test2 c <NA>
test3 e <NA>
test4 g h
Another approach:
na_dups <- function(df){
idx <- (table(unlist(df)) > 1)[unlist(df)]
is.na(df) <- array(idx, dim(df))
df
}
CodePudding user response:
You could count the number of time each values appear, and get the names of those who appear more than once, then assign NA to those values.
dup <- unique(unlist(dft))[table(unlist(dft)) > 1]
dft[dft %in% dup] <- NA
X1 X2
test1 <NA> b
test2 c <NA>
test3 e <NA>
test4 g h