Home > OS >  Which rows/columns are duplicates of which others in R matrices?
Which rows/columns are duplicates of which others in R matrices?

Time:01-03

I have a large matrix of the nature

x <- matrix(c(1, 1, 3, 3, 55, 55, 1, 3, 3, 1,
              1, 1, 3, 3, 55, 55, 1, 3, 9, 1), ncol = 2)

It's easy to delete duplicate rows using

x[!duplicated(x, MARGIN = 1), ]

I wish to identify the first occurrence of each row in the original matrix: so here we'd have the vector

1 1 3 3 5 5 1 3 9 1

The best I've come up with is a convoluted and roundabout approach using for loops that is neither efficient nor elegant. I'm also aware of possible solutions for data.frames; those involving concatenating rows into strings are quite resource-intensive too.

Is there an elegant solution using base R?


Ugly solution

# Identify duplicates
duplicate <- duplicated(x, MARGIN = 1)

# Identify first occurrence of each duplicate
firstDup <- duplicated(x, MARGIN = 1, fromLast = TRUE) & !duplicate
indices <- which(firstDup)

# Initialize index for unique rows
index <- seq_len(dim(x)[1])

cf <- duplicate
for (i in indices) {
  # Duplicates must occur after first occurrence
  cf[seq_len(i)] <- FALSE
  dups <- apply(x[cf, , drop = FALSE], 1L, identical, x[i, ])
  index[which(cf)[dups]] <- i
}
index

CodePudding user response:

How about this?

l <- asplit(x, 1L)
match(l, l)
 [1] 1 1 3 3 5 5 1 3 9 1

Here, we are using asplit to obtain a list l of the rows of x and match to obtain the index of the first occurrence of each row.

CodePudding user response:

If you have large matrix, then the following solution might suffice:

l <- do.call(paste, data.frame(x))
match(l, l)
[1] 1 1 3 3 5 5 1 3 9 1

CodePudding user response:

I thought @Mikael Jagan indeed provided "an elegant solution using base R", using asplit and match(<list>, <list>). However, as noted in the match doc, "Matching for lists is potentially very slow and best avoided except in simple cases.".

Because I had never seen match(<list>, <list>) used on larger data, I got a bit curious and decided to do some timings, mainly to compare it with the other nice base answers, by @Onyambu based on paste and by @ThomasIsCoding based on ave. In addition, although I realize that OP is asking for a base solution, I wanted to compare it with a data.table alternative I had in mind: a self-join providing the indices (which = TRUE) of the first match (mult = "first").

In the first timing I used a matrix with 5 000 000 rows and 2 columns:

set.seed(1)
v <- sample(1:10, 1e7, replace = TRUE)
m2 <- matrix(v, ncol = 2)

Benchmark

microbenchmark(
  asplit2 = {
    l = asplit(m2, 1L)
    match(l, l) 
  },
  paste2 = {
    l <- do.call(paste, data.frame(m2))
    match(l, l)
  },
  join2 = {
    d <- as.data.table(m2)
    d[d, on = names(d), mult = "first", which = TRUE]
  },
  ave1 = {
    ave(1:nrow(m2), m2[, 1], m2[, 2], FUN = function(v) v[1])
  },
  ave2 = {
    z <- interaction(as.data.frame(m2))
    ave(seq_along(z), z, FUN = function(x) x[1])
  },
  times = 10L
)

Unit: milliseconds
    expr        min         lq       mean     median        uq        max neval
 asplit2 39411.8573 41184.0907 46558.1971 46600.4588 49748.821 57749.5486    10
  paste2  1215.5265  1349.0530  1860.7159  1501.6314  1723.052  5180.9819    10
   join2   256.7032   297.5577   352.9393   338.5435   366.372   614.2644    10
    ave1   591.6729   735.9517  1775.9958  1679.0257  2506.367  3499.1352    10
    ave2   566.1164   633.2917  1860.7146   744.6268  3598.719  3921.0417    10

The match(<list>, <list>) is indeed quite a bit slower than the other base alternative.

In a second bencmark, a matrix with the same number of values was used, but fewer rows (500 000) and more columns (20). Now the relative difference between the alternatives is smaller.

m20 = matrix(v, ncol = 20)

microbenchmark(
  asplit20 = {
    l = asplit(m20, 1L)
    match(l, l) 
  },    
  paste20 = {
    l = do.call(paste, data.frame(m20))
    match(l, l)
  },     
  join20 = {
    d = as.data.table(m20)
    d[d, on = names(d), mult = "first", which = TRUE]
  },
  times = 10L)

# Unit: milliseconds
#     expr        min         lq      mean     median         uq        max 
# asplit20 11903.0618 12177.4604 12655.611 12686.7108 12873.9310 13824.7815
#  paste20   819.9107   839.2502  1061.545  1013.4790  1151.8000  1538.7871
#   join20   617.9096   621.4973   661.041   624.9289   628.7941   816.8372

Some checks:

d = as.data.table(x)
d[d, on = names(d), mult = "first", which = TRUE]   
# [1] 1 1 3 3 5 5 1 3 9 1 

l = do.call(paste, data.frame(m2))
r1 = match(l, l)

d = as.data.table(m2)
r2 = d[d, on = names(d), mult = "first", which = TRUE]

all.equal(r1, r2)
# [1] TRUE

CodePudding user response:

We can use ave if you are working with base R

> ave(1:nrow(x), x[, 1], x[, 2], FUN = function(v) v[1])
 [1] 1 1 3 3 5 5 1 3 9 1

If you have multiple columns, you can try

> z <- as.integer(interaction(as.data.frame(m2)))

> ave(seq_along(z), z, FUN = function(x) x[1])
 [1] 1 1 3 3 5 5 1 3 9 1

or

> z <- as.integer(interaction(as.data.frame(x)))

> match(z, z)
 [1] 1 1 3 3 5 5 1 3 9 1

Benchmarking

set.seed(1)
v <- sample(1:10, 1e7, replace = TRUE)
m2 <- matrix(v, ncol = 2)

microbenchmark(
  ave1 = {
    ave(1:nrow(m2), m2[, 1], m2[, 2], FUN = function(v) v[1])
  },
  ave2 = {
    z <- as.integer(interaction(as.data.frame(m2)))
    ave(seq_along(z), z, FUN = function(x) x[1])
  },
  match = {
    z <- as.integer(interaction(as.data.frame(m2)))
    match(z, z)
  },
  times = 10L
)

and we will see

Unit: milliseconds
  expr      min       lq     mean   median       uq       max neval
  ave1 648.0755 655.9521 715.8848 701.1927 747.4759  885.9838    10
  ave2 785.4868 883.2935 913.3867 899.1789 929.6571 1050.9020    10
 match 417.1598 447.3718 507.0462 495.8791 551.9436  625.0841    10
  • Related