I have a large matrix of the nature
x <- matrix(c(1, 1, 3, 3, 55, 55, 1, 3, 3, 1,
1, 1, 3, 3, 55, 55, 1, 3, 9, 1), ncol = 2)
It's easy to delete duplicate rows using
x[!duplicated(x, MARGIN = 1), ]
I wish to identify the first occurrence of each row in the original matrix: so here we'd have the vector
1 1 3 3 5 5 1 3 9 1
The best I've come up with is a convoluted and roundabout approach using for loops that is neither efficient nor elegant. I'm also aware of possible solutions for data.frames; those involving concatenating rows into strings are quite resource-intensive too.
Is there an elegant solution using base R?
Ugly solution
# Identify duplicates
duplicate <- duplicated(x, MARGIN = 1)
# Identify first occurrence of each duplicate
firstDup <- duplicated(x, MARGIN = 1, fromLast = TRUE) & !duplicate
indices <- which(firstDup)
# Initialize index for unique rows
index <- seq_len(dim(x)[1])
cf <- duplicate
for (i in indices) {
# Duplicates must occur after first occurrence
cf[seq_len(i)] <- FALSE
dups <- apply(x[cf, , drop = FALSE], 1L, identical, x[i, ])
index[which(cf)[dups]] <- i
}
index
CodePudding user response:
How about this?
l <- asplit(x, 1L)
match(l, l)
[1] 1 1 3 3 5 5 1 3 9 1
Here, we are using asplit
to obtain a list l
of the rows of x
and match
to obtain the index of the first occurrence of each row.
CodePudding user response:
If you have large matrix, then the following solution might suffice:
l <- do.call(paste, data.frame(x))
match(l, l)
[1] 1 1 3 3 5 5 1 3 9 1
CodePudding user response:
I thought @Mikael Jagan indeed provided "an elegant solution using base R", using asplit
and match(<list>, <list>)
. However, as noted in the match
doc, "Matching for lists is potentially very slow and best avoided except in simple cases.".
Because I had never seen match(<list>, <list>)
used on larger data, I got a bit curious and decided to do some timings, mainly to compare it with the other nice base
answers, by @Onyambu based on paste
and by @ThomasIsCoding based on ave
. In addition, although I realize that OP is asking for a base
solution, I wanted to compare it with a data.table
alternative I had in mind: a self-join providing the indices (which = TRUE
) of the first match (mult = "first"
).
In the first timing I used a matrix with 5 000 000 rows and 2 columns:
set.seed(1)
v <- sample(1:10, 1e7, replace = TRUE)
m2 <- matrix(v, ncol = 2)
Benchmark
microbenchmark(
asplit2 = {
l = asplit(m2, 1L)
match(l, l)
},
paste2 = {
l <- do.call(paste, data.frame(m2))
match(l, l)
},
join2 = {
d <- as.data.table(m2)
d[d, on = names(d), mult = "first", which = TRUE]
},
ave1 = {
ave(1:nrow(m2), m2[, 1], m2[, 2], FUN = function(v) v[1])
},
ave2 = {
z <- interaction(as.data.frame(m2))
ave(seq_along(z), z, FUN = function(x) x[1])
},
times = 10L
)
Unit: milliseconds
expr min lq mean median uq max neval
asplit2 39411.8573 41184.0907 46558.1971 46600.4588 49748.821 57749.5486 10
paste2 1215.5265 1349.0530 1860.7159 1501.6314 1723.052 5180.9819 10
join2 256.7032 297.5577 352.9393 338.5435 366.372 614.2644 10
ave1 591.6729 735.9517 1775.9958 1679.0257 2506.367 3499.1352 10
ave2 566.1164 633.2917 1860.7146 744.6268 3598.719 3921.0417 10
The match(<list>, <list>)
is indeed quite a bit slower than the other base
alternative.
In a second bencmark, a matrix with the same number of values was used, but fewer rows (500 000) and more columns (20). Now the relative difference between the alternatives is smaller.
m20 = matrix(v, ncol = 20)
microbenchmark(
asplit20 = {
l = asplit(m20, 1L)
match(l, l)
},
paste20 = {
l = do.call(paste, data.frame(m20))
match(l, l)
},
join20 = {
d = as.data.table(m20)
d[d, on = names(d), mult = "first", which = TRUE]
},
times = 10L)
# Unit: milliseconds
# expr min lq mean median uq max
# asplit20 11903.0618 12177.4604 12655.611 12686.7108 12873.9310 13824.7815
# paste20 819.9107 839.2502 1061.545 1013.4790 1151.8000 1538.7871
# join20 617.9096 621.4973 661.041 624.9289 628.7941 816.8372
Some checks:
d = as.data.table(x)
d[d, on = names(d), mult = "first", which = TRUE]
# [1] 1 1 3 3 5 5 1 3 9 1
l = do.call(paste, data.frame(m2))
r1 = match(l, l)
d = as.data.table(m2)
r2 = d[d, on = names(d), mult = "first", which = TRUE]
all.equal(r1, r2)
# [1] TRUE
CodePudding user response:
We can use ave
if you are working with base R
> ave(1:nrow(x), x[, 1], x[, 2], FUN = function(v) v[1])
[1] 1 1 3 3 5 5 1 3 9 1
If you have multiple columns, you can try
> z <- as.integer(interaction(as.data.frame(m2)))
> ave(seq_along(z), z, FUN = function(x) x[1])
[1] 1 1 3 3 5 5 1 3 9 1
or
> z <- as.integer(interaction(as.data.frame(x)))
> match(z, z)
[1] 1 1 3 3 5 5 1 3 9 1
Benchmarking
set.seed(1)
v <- sample(1:10, 1e7, replace = TRUE)
m2 <- matrix(v, ncol = 2)
microbenchmark(
ave1 = {
ave(1:nrow(m2), m2[, 1], m2[, 2], FUN = function(v) v[1])
},
ave2 = {
z <- as.integer(interaction(as.data.frame(m2)))
ave(seq_along(z), z, FUN = function(x) x[1])
},
match = {
z <- as.integer(interaction(as.data.frame(m2)))
match(z, z)
},
times = 10L
)
and we will see
Unit: milliseconds
expr min lq mean median uq max neval
ave1 648.0755 655.9521 715.8848 701.1927 747.4759 885.9838 10
ave2 785.4868 883.2935 913.3867 899.1789 929.6571 1050.9020 10
match 417.1598 447.3718 507.0462 495.8791 551.9436 625.0841 10