For the following example, I am looking for an efficient solution. In base R only.
# toy data/example
idx <- c(1,2,3,3,4)
vals <- c(3,6,7,1,5)
res <- rep(NA, length = 10)
res[idx] <- vals
# gives
res[idx]
#> [1] 3 6 1 1 5
What I am aiming for instead:
# desired output
res[idx]
[1] 3 6 4 4 5
E.g. if idx
is not unique (the case for $idx=3$), I would like to store the mean of $7 1$ instead of $1$ [last evaluated value].
Note, in the real application $idx=3$ may occur several times. Also, there can be thousands of non-unique indices/values in idx
.
CodePudding user response:
You can use aggregate
to get the mean
per idx.
. <- aggregate(vals ~ idx, FUN=mean)
res[.$idx] <- .$vals
res[idx]
#[1] 3 6 4 4 5
Or using tapply
.
. <- tapply(vals, idx, mean)
res[as.integer(names(.))] <- .
res[idx]
#[1] 3 6 4 4 5
CodePudding user response:
If you're asking for a base R solution due to speed, you might want to explore a tapply
-solution like: res <- tapply(vals, idx, mean)[idx]
over the accepted aggregate
-solution above. (Now, the author has added a tapply
-solution as well).
Testing suggests that it is indeed faster:
idx <- c(1,2,3,3,4)
vals <- c(3,6,7,1,5)
res <- rep(NA, length = length(idx))
agg_fun <- function(res, vals, idx) { # By: GKi
. <- aggregate(vals ~ idx, FUN=mean)
res[.$idx] <- .$vals
res[idx]
}
ave_fun <- function(res, vals, idx) { # By: Pax/MrFlick
res <- ave(vals, idx, FUN = mean)
res
}
apply_fun <- function(res, vals, idx) {
res <- tapply(vals, idx, mean)[idx] |> as.vector()
res
}
bench::mark(
agg_fun(res, vals, idx),
ave_fun(res, vals, idx),
apply_fun(res, vals, idx)
)
# A tibble: 3 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
# agg_fun(res, vals, idx) 221.9µs 259.7µs 3710. 0B 2.43 1527 1 412ms <dbl [5]>
# ave_fun(res, vals, idx) 36µs 39.8µs 24262. 0B 2.45 9883 1 407ms <dbl [5]>
# apply_fun(res, vals, idx) 34.2µs 36.1µs 27200. 0B 5.47 9941 2 365ms <dbl [5]>
Also much faster on bigger samples:
idx2 <- sample(1:100, 100000, replace = TRUE)
vals2 <- sample(1:1000, 100000, replace = TRUE)
res2 <- rep(NA, length = length(idx2))
bench::mark(
agg_fun(res2, vals2, idx2),
ave_fun(res2, vals2, idx2),
apply_fun(res2, vals2, idx2)
)
# A tibble: 3 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
# agg_fun(res2, vals2, idx2) 28.37ms 28.89ms 34.6 26.5MB 2.47 14 1 404ms <dbl>
# ave_fun(res2, vals2, idx2) 4.16ms 4.57ms 220. 5.98MB 2.53 87 1 396ms <dbl>
# apply_fun(res2, vals2, idx2) 2.74ms 2.82ms 328. 6.36MB 2.58 127 1 387ms <dbl>