In R, compare two columns in different dataframes and delete duplicates-CodePudding

this might be quite basic but I need to compare two columns of string elements in two different datasets, then delete the entries that are duplicates so I can work with the remaining elements.

Currently, I have the following:

compare <- append(test1$TL,test2$tl)
compare <- compare[!duplicated(compare)]

But it only deletes the first copy of duplicated elements. I need it to delete both copies so I can work with only the non-duplicates. Can anyone help?

CodePudding user response：

You may try

x <- c(1,2,3)
y <- c(3,4,5)


z <- union(x,y)  
z[! z %in% intersect(x,y)]
[1] 1 2 4 5

CodePudding user response：

Using %in% or duplicated.

x <- 1:3
y <- 3:5

c(x[!x %in% y], y[!y %in% x])
#[1] 1 2 4 5

. <- c(x, y)
.[!(duplicated(.) | duplicated(., fromLast = TRUE))]
#[1] 1 2 4 5

Benchmark:

x <- 1:3
y <- 3:5

bench::mark("UniInter" = {z <- union(x,y); z[! z %in% intersect(x,y)]},
            "%in%" = c(x[!x %in% y], y[!y %in% x]),
            "dupli" = {. <- c(x, y); .[!(duplicated(.) | duplicated(., fromLast = TRUE))]})
#  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 UniInter    8.15µs    10µs    78539.        0B     39.3  9995     5    127.3ms
#2 %in%        1.68µs  1.99µs   418373.        0B     41.8  9999     1     23.9ms
#3 dupli       4.35µs  5.13µs   178003.        0B     17.8  9999     1     56.2ms