the distinct function in R generates unique values within the same column. However, I would like to have unique values regardless which column the value appears in.
The sample data is shown below. 10A appears in the second row under var 1. It appears again in the third row, although it is in var 2 this time.
I would like to remove the entire third row since 10A is a duplicate. How do I do it in R?
Sample data
var 1 | var 2 |
---|---|
5A | 5B |
10A | 10B |
7A | 10A |
6B | 5C |
Required Result
var 1 | var 2 |
---|---|
5A | 5B |
10A | 10B |
6B | 5C |
CodePudding user response:
If df has var1 & var2 variables and you wanting to maintain only var1 distinct values:
df |>
filter(!var2 %in% unique(var1))
CodePudding user response:
The data:
dt <- read.table(header=TRUE, text = "
'var1' 'var2'
'5A' '5B2'
'10A' '10B'
'7A' '10A'
'6B' '5C'
'10A' '7A'
'2A' '3B'
'3B' '99B'")
From your description I assume that you want to determine duplicates by
row (not by column).
This creates a matrix that corresponds with the dimensions of the data frame
and indicates unique values with TRUE
.
dd <- matrix(!duplicated(c(t(dt))), ncol=ncol(dt), byrow=TRUE)
c(t(dt))
creates a vector of the the data frames content going through it
row by row (see here). !duplicated()
determines which values are unique. The matrix()
re-formats the vector so that it’s dimension and order matches the data frame again.
Now we filter dt
simply through sub-setting with a logical vector. The vector we create
by setting FALSE
to all rows that contain at least one duplicate. This is achieved
by the function all
that requires all elements in a row to be TRUE
in order to return TRUE
.
dt[apply(dd, 1, all),]
#> var1 var2
#> 1 5A 5B2
#> 2 10A 10B
#> 4 6B 5C
#> 6 2A 3B
Created on 2022-06-01 by the reprex package (v2.0.1)