set all duplicated cells in a dataframe NA in R-CodePudding

I have a dataframe where several values are duplicated across different rows/columns. I would like to overwrite every cell that occurrs again somewhere in the df with NA. So far I have tried a for loop:

test1 <- c("a","b")
test2 <- c("d","a")

dft <- data.frame(rbind(test1, test2))

l=0
for(i in dft$X1){
  l <- l 1
  j=0
  for(k in dft$X2){
    j<-j 1
    print(i)
    print(k)
    ifelse(k==i, dft$X1[l]<-NA, dft$X1[l] <- i)
    ifelse(k==i, dft$X2[j]<-NA, dft$X2[j]<-k)
  }
}

which yields

       X1   X2
test1 <NA>    b
test2    d <NA>

...perfect... but when I expand my df to

test1 <- c("a","b")
test2 <- c("c","a")
test3 <- c("e","a")
test4 <- c("g","h")

dft <- data.frame(rbind(test1, test2, test3, test4))
> dft
      X1 X2
test1  a  b
test2  c  a
test3  e  a
test4  g  h

and run the same script (to set all "a" NA) it yields

> dft
      X1   X2
test1  a    b
test2  c <NA>
test3  e <NA>
test4  g    h

why does that happen? and is there an easier way to set all duplicated cells NA?

CodePudding user response：

Use the following script:

na_dups <- function(df){
  idx <- duplicated(unlist(df)) | duplicated(unlist(df), fromLast = TRUE)
  is.na(df) <- array(idx, dim(df))
  df
}
na_dups(dft)
        X1   X2
test1 <NA>    b
test2    c <NA>
test3    e <NA>
test4    g    h

Another approach:

na_dups <- function(df){
  idx <- (table(unlist(df)) > 1)[unlist(df)]
  is.na(df) <- array(idx, dim(df))
  df
}

CodePudding user response：

You could count the number of time each values appear, and get the names of those who appear more than once, then assign NA to those values.

dup <- unique(unlist(dft))[table(unlist(dft)) > 1]
dft[dft %in% dup] <- NA

        X1   X2
test1 <NA>    b
test2    c <NA>
test3    e <NA>
test4    g    h