Home > Back-end >  how to remove duplicates based on each row using strings
how to remove duplicates based on each row using strings

Time:12-08

I have a data like this

df<- structure(list(Core = c("Bestman", "Tetra"), member1 = c("Tera1", 
"Brownie1"), member2 = c("Tera2", "Brownie2"), member3 = c("Tera3", 
"Brownie3"), member4 = c("Tera4", "Brownie4"), member5 = c("Tera5", 
"Brownie5"), member6 = c("Brownie2", "Tera2"), member7 = c("Tera1", 
"Tera1"), member8 = c("Tera2", "")), class = "data.frame", row.names = c(NA, 
-2L))

it looks like this

Core    member1 member2 member3 member4 member5 member6 member7 member8
Bestman Tera1   Tera2   Tera3   Tera4   Tera5   Brownie2    Tera1   Tera2
Tetra   Brownie1    Brownie2    Brownie3    Brownie4    Brownie5    Tera2   Tera1   

if we look at the first row , we can see that Tera1 and Tera2 are repeated which must be deleted

when we go to the next row

we can see

Brownie2, Tera1 and Tera2 are repeated and must be deleted

my desire output looks like this

Core    member1 member2 member3 member4 member5 member6
Bestman Tera1   Tera2   Tera3   Tera4   Tera5   Brownie2
Tetra   Brownie1    Brownie3    Brownie4    Brownie5

    

CodePudding user response:

One way could be with pivoting:

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(-Core) %>% 
  distinct(value, .keep_all = TRUE) %>% 
  pivot_wider(names_from=name, values_from = value)
 Core    member1  member2 member3  member4  member5  member6  member8
  <chr>   <chr>    <chr>   <chr>    <chr>    <chr>    <chr>    <chr>  
1 Bestman Tera1    Tera2   Tera3    Tera4    Tera5    Brownie2 NA     
2 Tetra   Brownie1 NA      Brownie3 Brownie4 Brownie5 NA    

CodePudding user response:

If we are interested in any duplicates to be NA, then an option is to apply duplicated on the vector of values from all columns except the first and assign those duplicates to NA in the original data

df[-1][matrix(duplicated(c(t(df[-1]))), nrow = nrow(df), 
        byrow = TRUE)] <- NA_character_

-output

> df
     Core  member1 member2  member3  member4  member5  member6 member7 member8
1 Bestman    Tera1   Tera2    Tera3    Tera4    Tera5 Brownie2    <NA>    <NA>
2   Tetra Brownie1    <NA> Brownie3 Brownie4 Brownie5     <NA>    <NA>        

Then, we subset the columns based on columns having at least one non-NA, non-blank value

df1 <-  df[colSums(is.na(df)|df == "") < nrow(df)]
df1

     Core  member1 member2  member3  member4  member5  member6
1 Bestman    Tera1   Tera2    Tera3    Tera4    Tera5 Brownie2
2   Tetra Brownie1    <NA> Brownie3 Brownie4 Brownie5     <NA>
  •  Tags:  
  • r
  • Related