Finding distinct values across multiple columns in R-CodePudding

the distinct function in R generates unique values within the same column. However, I would like to have unique values regardless which column the value appears in.

The sample data is shown below. 10A appears in the second row under var 1. It appears again in the third row, although it is in var 2 this time.

I would like to remove the entire third row since 10A is a duplicate. How do I do it in R?

Sample data

var 1	var 2
5A	5B
10A	10B
7A	10A
6B	5C

Required Result

var 1	var 2
5A	5B
10A	10B
6B	5C

CodePudding user response：

If df has var1 & var2 variables and you wanting to maintain only var1 distinct values:

df |> 
  filter(!var2 %in% unique(var1))

CodePudding user response：

The data:

dt <- read.table(header=TRUE, text = "
'var1'  'var2'
'5A'    '5B2'
'10A'   '10B'
'7A'    '10A'
'6B'    '5C'
'10A' '7A'
'2A'  '3B'
'3B'  '99B'")

From your description I assume that you want to determine duplicates by row (not by column). This creates a matrix that corresponds with the dimensions of the data frame and indicates unique values with TRUE.

dd <- matrix(!duplicated(c(t(dt))), ncol=ncol(dt), byrow=TRUE)

c(t(dt)) creates a vector of the the data frames content going through it row by row (see here). !duplicated() determines which values are unique. The matrix() re-formats the vector so that it’s dimension and order matches the data frame again. Now we filter dt simply through sub-setting with a logical vector. The vector we create by setting FALSE to all rows that contain at least one duplicate. This is achieved by the function all that requires all elements in a row to be TRUE in order to return TRUE.

dt[apply(dd, 1, all),]
#>   var1 var2
#> 1   5A  5B2
#> 2  10A  10B
#> 4   6B   5C
#> 6   2A   3B

^{Created on 2022-06-01 by the reprex package (v2.0.1)}