Why does wrapping this loop in a function speed it up by 8x?-CodePudding

I'm trying to gain a deeper understanding of loops vs. *apply functions in R. Here, I did an experiment where I compute the first 10,000 triangular numbers in 3 different ways.

unwrapped: a simple for loop
wrapped: I take the exact same loop from before, but wrap it in a function.
vapply: Using vapply and an anonymous function.

The results surprised me in two different ways.

Why is wrapped 8x faster than unwrapped (?!?!) My intuition is that given wrapped actually does more stuff (defines a function and then calls it), it should have been slower.
Why are they both so much faster than vapply? I would have expected vapply to be able to do some kind of optimization that performs at least as well as the loops.

microbenchmark::microbenchmark(
  unwrapped = {
    x <- numeric(10000)
    for (i in 1:10000) {
      x[i] <- i * (i   1) / 2
    }
    x
  },
  wrapped = {
    tri_nums <- function(n) {
      x <- numeric(n)
      for (i in 1:n) {
        x[i] <- i * (i   1) / 2
      }
      x
    }
    tri_nums(10000)
    },
  vapply = vapply(1:10000, \(i) i * (i   1) / 2, numeric(1)),
  check = 'equal'
)
#> Unit: microseconds
#>       expr      min       lq     mean    median       uq       max neval
#>  unwrapped 2652.487 3006.888 3445.896 3150.7555 3832.094  7029.949   100
#>    wrapped  398.534  414.010  455.333  439.7445  469.307   656.074   100
#>     vapply 4942.000 5154.639 5937.333 5453.2880 5969.760 13730.718   100

^{Created on 2023-01-04 with reprex v2.0.2}

CodePudding user response：

It's byte-compiling your function.

We can confirm just-in-time (JIT) compilation with:

compiler::enableJIT(-1)
# [1] 3                        # <--- this is the previous JIT level

where negative returns the current level unchanged, and a value of 3 means highest JIT compiling level. I'm not certain what steps each level is doing, but we can make a simple test to compare them. (See ?enableJIT for more info.)

compiler::enableJIT(0)
# [1] 3
tri_nums <- function(n) {
  x <- numeric(n)
  for (i in 1:n) {
    x[i] <- i * (i   1) / 2
  }
  x
}
bench::mark(
  unwrapped = {
    x <- numeric(10000)
    for (i in 1:10000) {
      x[i] <- i * (i   1) / 2
    }
    x
  },
  JIT0 = tri_nums(10000),
  vapply = vapply(1:10000, \(i) i * (i   1) / 2, numeric(1))
)
# # A tibble: 3 × 13
#   expression      min   median `itr/sec` mem_al…¹ gc/se…² n_itr  n_gc total…³ result memory     time       gc      
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t> <list> <list>     <list>     <list>  
# 1 unwrapped    8.21ms    8.7ms      113.   78.2KB    7.07    48     3   424ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 2 JIT0         7.26ms   7.72ms      128.   78.2KB    9.84    52     4   407ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 3 vapply       5.97ms    6.5ms      152.   78.2KB    9.51    64     4   421ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# # … with abbreviated variable names ¹mem_alloc, ²`gc/sec`, ³total_time

(I can't put all three levels in at once, since I believe the JIT check is done when we call it as well as when we define it. I'm really not qualified to speak to this level of R-internal, so ... please correct me and/or add amplifying information.)

Doing this again for levels 1-3 and copy/pasting the relevant bench::mark rows, we see:

# # A tibble: 3 × 13
#   expression      min   median `itr/sec` mem_al…¹ gc/se…² n_itr  n_gc total…³ result memory     time       gc      
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t> <list> <list>     <list>     <list>  
# 1 unwrapped    8.21ms    8.7ms      113.   78.2KB    7.07    48     3   424ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 2 JIT0         7.26ms   7.72ms      128.   78.2KB    9.84    52     4   407ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 2 JIT1        419.6µs  502.5µs     1923.  108.7KB    0      962     0   500ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 2 JIT2        413.4µs  494.3µs     1971.  108.7KB    0      986     0   500ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 2 JIT3        426.7µs  498.3µs     1981.  108.7KB    0      991     0   500ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 3 vapply       5.97ms    6.5ms      152.   78.2KB    9.51    64     4   421ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# # … with abbreviated variable names ¹mem_alloc, ²`gc/sec`, ³total_time

showing that the vast majority of gains are in the first level of byte-compiling (not too surprising given the simplicity of this function).

Note: for anybody who is actually testing some of this code, you might want to ensure you're back at the default level of 3:

compiler::enableJIT(3)