Here is my data:
mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 5, 9, 6, 6, 4, 1, 4), .Dim = c(7L, 2L))
Some rows are duplicated, several other rows contain the same elements although they are differentially ordered. I wish to remove all rows that contain the same elements, whether these elements are in the same (duplicated rows) or different order. This will retain only the first row of c(3, 5)
.
I checked previous questions here and here. However, my requirement is that all such rows are removed rather than leaving one such row.
My question is also different from this one which removes all duplicated rows in that I look for rows not just duplicated, but also those that contain the same set of elements that are ordered differently. For example, rows c(6, 9)
and c(9, 6)
should both be removed since they both contian the same set of elements.
I look for solutions not using for loop since my real data is large and for loop may be slow.
Note: My full data has 40k rows and 2 columns.
CodePudding user response:
You can sort the data rowwise and use duplicated
-
tmp <- t(apply(mymat, 1, sort))
tmp[!(duplicated(tmp) | duplicated(tmp, fromLast = TRUE)), , drop = FALSE]
# [,1] [,2]
#[1,] 3 5
CodePudding user response:
I added a little data to show that the matrix format remains
mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 10, 12, 13, 14, 5, 9, 6, 6, 4, 1, 4, 11, 13, 12, 15), .Dim = c(11L, 2L))
dup <- duplicated(rbind(mymat, mymat[, c(2, 1)]))
dup_fromLast <- duplicated(rbind(mymat, mymat[, c(2, 1)]), fromLast = TRUE)
mymat_duprm <- mymat[!(dup_fromLast | dup)[1:(length(dup) / 2)], ]
mymat_duprm
CodePudding user response:
As a matrix:
tmp <- apply(mymat, 1, function(z) toString(sort(z)))
mymat[ave(tmp, tmp, FUN = length) == "1",, drop = FALSE]
# [,1] [,2]
# [1,] 3 5
The drop=FALSE
is required only because (at least with this sample data) the filtering results in one row. While I doubt your real data (with 40k rows) would reduce to this, I recommend you keep it in there anyway ("just in case", and it's just defensive programming).
CodePudding user response:
You can just use, the following line of code:
mymat <- mymat[!mymat[,1] %in% mymat[,2], , drop = FALSE]
output:
mymat
#> [,1] [,2]
#> [1,] 3 5
Created on 2021-09-24 by the reprex package (v0.3.0)