Home > Mobile >  set all duplicated cells in a dataframe NA in R
set all duplicated cells in a dataframe NA in R

Time:04-21

I have a dataframe where several values are duplicated across different rows/columns. I would like to overwrite every cell that occurrs again somewhere in the df with NA. So far I have tried a for loop:

test1 <- c("a","b")
test2 <- c("d","a")

dft <- data.frame(rbind(test1, test2))

l=0
for(i in dft$X1){
  l <- l 1
  j=0
  for(k in dft$X2){
    j<-j 1
    print(i)
    print(k)
    ifelse(k==i, dft$X1[l]<-NA, dft$X1[l] <- i)
    ifelse(k==i, dft$X2[j]<-NA, dft$X2[j]<-k)
  }
}

which yields

       X1   X2
test1 <NA>    b
test2    d <NA>

...perfect... but when I expand my df to

test1 <- c("a","b")
test2 <- c("c","a")
test3 <- c("e","a")
test4 <- c("g","h")

dft <- data.frame(rbind(test1, test2, test3, test4))
> dft
      X1 X2
test1  a  b
test2  c  a
test3  e  a
test4  g  h

and run the same script (to set all "a" NA) it yields

> dft
      X1   X2
test1  a    b
test2  c <NA>
test3  e <NA>
test4  g    h

why does that happen? and is there an easier way to set all duplicated cells NA?

CodePudding user response:

Use the following script:

na_dups <- function(df){
  idx <- duplicated(unlist(df)) | duplicated(unlist(df), fromLast = TRUE)
  is.na(df) <- array(idx, dim(df))
  df
}
na_dups(dft)
        X1   X2
test1 <NA>    b
test2    c <NA>
test3    e <NA>
test4    g    h

Another approach:

na_dups <- function(df){
  idx <- (table(unlist(df)) > 1)[unlist(df)]
  is.na(df) <- array(idx, dim(df))
  df
}

CodePudding user response:

You could count the number of time each values appear, and get the names of those who appear more than once, then assign NA to those values.

dup <- unique(unlist(dft))[table(unlist(dft)) > 1]
dft[dft %in% dup] <- NA

        X1   X2
test1 <NA>    b
test2    c <NA>
test3    e <NA>
test4    g    h
  • Related