Home > Software engineering >  Drop duplicates and keep first in r data.table
Drop duplicates and keep first in r data.table

Time:03-26

not familiar with R sorry for the question that I could not find already.

Suppose I have a network of IPs of data of this type:

toy_data = data.table(from=c("A","B","A","C","D","C"), to=c("B","A","C","B","A","A"))

from to
A B
B A
A C
C B
D A
C A

I cannot load the whole network in igraph and trying to compute statistics based on chunks. So given that the network is undirected I would like to drop all those lines that have the opposite from-to pattern (row 2, row 6).

I originally thought that something like this would work: unique(toy_data[,.(c(from,to)|c(to,from))]) unfortunately

I thought to use two auxiliary columns:

toy_data[,orig:=paste(from,to,sep="")]
toy_data[,reverse:=paste(to,from,sep="")]

then work with something like: unique(df[,.(?)])

but my guess is that this is way easier than what I am doing.

CodePudding user response:

Instead of creating temporary column, paste the min by row (pmin) with the max by row (pmax) and remove the duplicates with duplicated and negate (!)

toy_data[!duplicated(paste(pmin(from, to), pmax(from, to)))]

-output

    from     to
   <char> <char>
1:      A      B
2:      A      C
3:      C      B
4:      D      A
  • Related