I am working with a large data table and need to find the row numbers by group. Unfortunately, sorting the datatable is not an option as they are indexed several places (by id, time, ...) so I don't think setkey
can be used.
What is the most efficient way to approach this problem?
I have currently tried which(...)
, k[..., which = TRUE]
, and k[, .I[...]]
. Are there any quicker ways?
By benchmarking, which(...)
seems more effective than k[..., which = TRUE]
for smaller data tables (less than 10 000 rows, full code below):
test replications elapsed relative user.self sys.self
1 k[a == x, which = TRUE] 10 2.63 1.789 2.52 0.10
2 which(k$a == x) 10 1.47 1.000 1.47 0.00
3 setindex(k, a) 10 2.71 1.844 2.64 0.06
4 k[, .I[a == x]] 10 2.03 1.381 2.00 0.00
But as the row number grows, k[..., which = TRUE]
is considerably quicker:
> rbenchmark::benchmark(
"A" = {
k <- data.table(
a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
)
u <- unique(k$a)
m <- lapply(u, function(x) k[a == x, which = TRUE])
},
"B" = {
k <- data.table(
a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
)
u <- unique(k$a)
m <- lapply(u, function(x) which(k$a == x))
},
"C" = {
k <- data.table(
a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
)
u <- unique(k$a)
setindex(k, a)
m <- lapply(u, function(x) k[a == x, which = TRUE])
},
"D" = {
k <- data.table(
a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
)
u <- unique(k$a)
setindex(k, a)
m <- lapply(u, function(x) k[, .I[a == x]])
},
replications = 10,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
test replications elapsed relative user.self sys.self
1 A 10 3.64 1.000 3.61 0.08
2 B 10 43.22 11.874 42.73 0.02
3 C 10 3.70 1.016 3.72 0.04
4 D 10 46.71 12.832 46.33 0.03
CodePudding user response:
Using .I
, this option returns a data.table
with two columns. The first column is the unique values in a
, and the second is a list of indices where each value appears in k
. The form is different than the OP's m
, but the information is all there and just as easily accessible.
k[, .(idx = .(.I)), a]
Benchmarking:
library(data.table)
k <- data.table(a = sample(factor(seq_len(200)), size = 1e6, replace = TRUE))
microbenchmark::microbenchmark(
A = {
u <- unique(k$a)
m <- lapply(u, function(x) k[a == x, which = TRUE])
},
B = {
m2 <- k[, .(idx = .(.I)), a]
},
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> A 282.0331 309.2662 335.30146 325.3355 350.51080 525.7929 100
#> B 9.7870 10.3598 13.04379 10.8292 12.73785 65.4864 100
all.equal(m, m2$idx)
#> [1] TRUE
all.equal(u, m2$a)
#> [1] TRUE