What is the most efficient method for finding row indices by group in a data.table in R?-CodePudding

I am working with a large data table and need to find the row numbers by group. Unfortunately, sorting the datatable is not an option as they are indexed several places (by id, time, ...) so I don't think setkey can be used.

What is the most efficient way to approach this problem?

I have currently tried which(...), k[..., which = TRUE], and k[, .I[...]]. Are there any quicker ways?

By benchmarking, which(...) seems more effective than k[..., which = TRUE] for smaller data tables (less than 10 000 rows, full code below):

                        test replications elapsed relative user.self sys.self
1    k[a == x, which = TRUE]           10    2.63    1.789      2.52     0.10
2    which(k$a == x)                   10    1.47    1.000      1.47     0.00
3    setindex(k, a)                    10    2.71    1.844      2.64     0.06
4    k[, .I[a == x]]                   10    2.03    1.381      2.00     0.00

But as the row number grows, k[..., which = TRUE] is considerably quicker:

> rbenchmark::benchmark(
    "A" = {
      k <- data.table(
        a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
      )
      u <- unique(k$a)
      m <- lapply(u, function(x) k[a == x, which = TRUE])
      },
    "B" = {
      k <- data.table(
        a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
      )
      u <- unique(k$a)
      m <- lapply(u, function(x) which(k$a == x))
    },
    "C" = {
      k <- data.table(
        a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
      )
      u <- unique(k$a)
      setindex(k, a)
      m <- lapply(u, function(x) k[a == x, which = TRUE])
    },
    "D" = {
      k <- data.table(
        a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
      )
      u <- unique(k$a)
      setindex(k, a)
      m <- lapply(u, function(x) k[, .I[a == x]])
    },
    replications = 10,
    columns = c("test", "replications", "elapsed",
                "relative", "user.self", "sys.self"))
  test replications elapsed relative user.self sys.self
1    A           10    3.64    1.000      3.61     0.08
2    B           10   43.22   11.874     42.73     0.02
3    C           10    3.70    1.016      3.72     0.04
4    D           10   46.71   12.832     46.33     0.03

CodePudding user response：

Using .I, this option returns a data.table with two columns. The first column is the unique values in a, and the second is a list of indices where each value appears in k. The form is different than the OP's m, but the information is all there and just as easily accessible.

k[, .(idx = .(.I)), a]

Benchmarking:

library(data.table)

k <- data.table(a = sample(factor(seq_len(200)), size = 1e6, replace = TRUE))

microbenchmark::microbenchmark(
  A = {
    u <- unique(k$a)
    m <- lapply(u, function(x) k[a == x, which = TRUE])
  },
  B = {
    m2 <- k[, .(idx = .(.I)), a]
  },
  times = 100
)
#> Unit: milliseconds
#>  expr      min       lq      mean   median        uq      max neval
#>     A 282.0331 309.2662 335.30146 325.3355 350.51080 525.7929   100
#>     B   9.7870  10.3598  13.04379  10.8292  12.73785  65.4864   100

all.equal(m, m2$idx)
#> [1] TRUE

all.equal(u, m2$a)
#> [1] TRUE