Identify groups of identical rows in a matrix-CodePudding

tl;dr What is the idiomatic way to identify groups of identical rows in a matrix in R?

Given an n-by-2 matrix where some rows occur more than once,

> mat <- matrix(c(2,5,5,3,4,6,2,5,4,6,4,6), ncol=2, byrow=T)
> mat
     [,1] [,2]
[1,]    2    5
[2,]    5    3
[3,]    4    6
[4,]    2    5
[5,]    4    6
[6,]    4    6

I am looking to get the groups of row indices of identical rows. In the example above, rows (1,4) are identical, and so are rows (3,5,6). Finally, there is row (2). I am looking to get these groups, represented in whatever way is idiomatic in R.

The output could be something like this,

> groups <- matrix(c(1,1, 2,2, 3,3, 4,1, 5,3, 6,3), ncol=2, byrow=T)
> groups
     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3
[4,]    4    1
[5,]    5    3
[6,]    6    3

where the first column contains the row indices of mat and the second the group index for each row index. Or it could be like this:

> split(groups[,1], groups[,2])
$`1`
[1] 1 4

$`2`
[1] 2

$`3`
[1] 3 5 6

Either will do. I am not sure what is the best way to represent groups in R, and advice on this is also welcome.

For benchmarking purposes, here's a larger dataset:

set.seed(123)
n <- 10000000
mat <- matrix(sample.int(10, 2*n, replace = T), ncol=2)

CodePudding user response：

cbind with sequence of rows and the match between the rows and unique values of the row

v1 <- paste(mat[,1], mat[,2])
# or if there are more columns
#v1 <-  do.call(paste, as.data.frame(mat))
out <- cbind(seq_len(nrow(mat)), match(v1, unique(v1)))

-output

> out
     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3
[4,]    4    1
[5,]    5    3
[6,]    6    3

If we want a list output

split(out[,1], out[,2])

-ouptut

$`1`
[1] 1 4

$`2`
[1] 2

$`3`
[1] 3 5 6

Benchmarks

With the OP's big data

> system.time({
  v1 <- paste(mat[,1], mat[,2])
  
  out <- cbind(seq_len(nrow(mat)), match(v1, unique(v1)))
  
  })
   user  system elapsed 
  2.603   0.130   2.706