Assign mean of values if id is not unique to respective vector index-CodePudding

For the following example, I am looking for an efficient solution. In base R only.

# toy data/example
idx <- c(1,2,3,3,4)
vals <- c(3,6,7,1,5)
res <- rep(NA, length = 10)

res[idx] <- vals

# gives 
res[idx]
#> [1] 3 6 1 1 5

What I am aiming for instead:

# desired output
res[idx] 
[1] 3 6 4 4 5

E.g. if idx is not unique (the case for $idx=3$), I would like to store the mean of $7 1$ instead of $1$ [last evaluated value].

Note, in the real application $idx=3$ may occur several times. Also, there can be thousands of non-unique indices/values in idx.

CodePudding user response：

You can use aggregate to get the mean per idx.

. <- aggregate(vals ~ idx, FUN=mean)
res[.$idx] <- .$vals
res[idx]
#[1] 3 6 4 4 5

Or using tapply.

. <- tapply(vals, idx, mean)
res[as.integer(names(.))] <- .
res[idx]
#[1] 3 6 4 4 5

CodePudding user response：

If you're asking for a base R solution due to speed, you might want to explore a tapply-solution like: res <- tapply(vals, idx, mean)[idx] over the accepted aggregate-solution above. (Now, the author has added a tapply-solution as well).

Testing suggests that it is indeed faster:

idx <- c(1,2,3,3,4)
vals <- c(3,6,7,1,5)
res <- rep(NA, length = length(idx))

agg_fun <- function(res, vals, idx) { # By: GKi
  
  . <- aggregate(vals ~ idx, FUN=mean)
  res[.$idx] <- .$vals
  res[idx]
  
}

ave_fun <- function(res, vals, idx) { # By: Pax/MrFlick

  res <- ave(vals, idx, FUN = mean)
  res

}

apply_fun <- function(res, vals, idx) {
  
  res <- tapply(vals, idx, mean)[idx] |> as.vector()
  res
  
}

bench::mark(
  agg_fun(res, vals, idx),
  ave_fun(res, vals, idx),
  apply_fun(res, vals, idx)
)
  
# A tibble: 3 × 13
# expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   
# <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>   
# agg_fun(res, vals, idx)    221.9µs  259.7µs     3710.        0B     2.43  1527     1      412ms <dbl [5]>
# ave_fun(res, vals, idx)       36µs   39.8µs    24262.        0B     2.45  9883     1      407ms <dbl [5]>
# apply_fun(res, vals, idx)   34.2µs   36.1µs    27200.        0B     5.47  9941     2      365ms <dbl [5]>

Also much faster on bigger samples:

idx2 <- sample(1:100, 100000, replace = TRUE)
vals2 <- sample(1:1000, 100000, replace = TRUE)
res2 <- rep(NA, length = length(idx2))

bench::mark(
  agg_fun(res2, vals2, idx2),
  ave_fun(res2, vals2, idx2),
  apply_fun(res2, vals2, idx2)
)

# A tibble: 3 × 13
# expression                        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
# <bch:expr>                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
# agg_fun(res2, vals2, idx2)    28.37ms  28.89ms      34.6    26.5MB     2.47    14     1      404ms <dbl> 
# ave_fun(res2, vals2, idx2)     4.16ms   4.57ms     220.     5.98MB     2.53    87     1      396ms <dbl> 
# apply_fun(res2, vals2, idx2)   2.74ms   2.82ms     328.     6.36MB     2.58   127     1      387ms <dbl>