Applying a function to all combinations of elements-CodePudding

Suppose I have a function of three variables such as

f <- function(x,y,z) x*y*z

Now, given three data sets

X <- seq(1,10)
Y <- seq(11,20)
Z <- seq(21,30)

I want to efficiently evaluate f on all combinations of elements from X,Y,Z. What is the smartest way to do this? (In practice, my function is more complicated and the sets are very large).

CodePudding user response：

Maybe this. Using prod to get the product of the elements by lines from expand.grid.

X <- 1:3
Y <- 4:6
Z <- 7:9

apply(expand.grid(X, Y, Z), 1, function(x) prod(x))
 [1]  28  56  84  35  70 105  42  84 126  32  64  96  40  80 120  48  96 144  36
[20]  72 108  45  90 135  54 108 162

Using an arbitrary function

f <- function(x, y, z) x * y * z

apply(expand.grid(X, Y, Z), 1, function(x) f(x[1], x[2], x[3]))
 [1]  28  56  84  35  70 105  42  84 126  32  64  96  40  80 120  48  96 144  36
[20]  72 108  45  90 135  54 108 162

With dplyr, using crossing

Note the difference between crossing and expand_grid/expand.grid

‘crossing()’ is a wrapper around ‘expand_grid()’ that de-duplicates and sorts its inputs

So its naturally a bit slower, gaining a functionality often needed.

library(dplyr)
library(tidyr)

crossing(X, Y, Z) %>% 
  rowwise() %>% 
  mutate(result = f(X, Y, Z))
# A tibble: 27 × 4
# Rowwise: 
       X     Y     Z result
   <int> <int> <int>  <int>
 1     1     4     7     28
 2     1     4     8     32
 3     1     4     9     36
 4     1     5     7     35
 5     1     5     8     40
 6     1     5     9     45
 7     1     6     7     42
 8     1     6     8     48
 9     1     6     9     54
10     2     4     7     56
# … with 17 more rows

And finally a data.table approach using CJ

‘CJ’ : * C *ross * J *oin. A ‘data.table’ is formed from the cross product of the vectors.

library(data.table)

CJ(X, Y, Z)[, f(X, Y, Z)]
 [1]  28  32  36  35  40  45  42  48  54  56  64  72  70  80  90  84  96 108  84
[20]  96 108 105 120 135 126 144 162

CodePudding user response：

Create combinations using expand.grid and iterate over the resulting data frame:

f <- function(x,y,z) x*y*z

X <- seq(1,10)
Y <- seq(11,20)
Z <- seq(21,30)

df = expand.grid(X, Y, Z)

for (i in 1:nrow(df)) {
    print(f(df[i, 1], df[i, 2], df[i, 3]))
}

CodePudding user response：

I think the right answer probably depends on the size of X, Y and Z. There are a few options for expanding all combinations of values, expand.grid() from base R and expand_grid() from the tidyr package as well as CJ() from the data.table package as @Andre Wildberg mentioned. You could then compute the results of the function with either a for loop over the rows of the expanded dataset, apply(), with mutate() in a pipeline or with a data.table approach. Depending on the size these solutions have different properties. First, consider the situation posed above where each is of length 10. Looking at the benchmarks for expand.grid() vs expand_grid() vs CJ(), they are of similar magnitude, though the expand.grid() and CJ() approaches are faster, on average, than the expand_grid() approach.

library(dplyr)
library(tidyr)
library(data.table)
library(microbenchmark)
  

f <- function(x,y,z) x*y*z

X <- seq(1,10)
Y <- seq(11,20)
Z <- seq(21,30)

microbenchmark(
  expand_grid(X,Y,Z), 
  expand.grid(X,Y,Z), 
  CJ(X, Y, Z), times=25)
#> Unit: microseconds
#>                  expr     min      lq     mean  median      uq      max neval
#>  expand_grid(X, Y, Z) 245.750 259.667 361.7473 269.126 282.834 2507.043    25
#>  expand.grid(X, Y, Z) 102.750 107.084 147.9838 112.584 122.625  952.251    25
#>           CJ(X, Y, Z) 115.293 124.168 202.5123 132.209 137.251 1885.709    25
#>  cld
#>    a
#>    a
#>    a

When considering the different ways of computing the result I've implemented four solutions:

exp1() is the for loop with expand.grid()
exp2() is the apply() with expand.grid().
exp3() is the dplyr solution with expand_grid()
exp4() is the data.table solution with CJ().

The data.table solution is the clear winner here beating the dplyr solution by a factor of about 3 (though not a statistically significant difference according to the CLD).


exp1 <- function(){
df = expand.grid(X, Y, Z)
for (i in 1:nrow(df)) {
  df$prod = f(df[i,1], df[i,2], df[i,3])
}
}

exp2 <- function(){
  df = expand.grid(X, Y, Z)
  df$prod <- apply(df, 1, function(x)f(x[1], x[2], x[3]))
}

exp3 <- function(){
  df = expand_grid(X,Y,Z)
  df %>% mutate(prod = f(X,Y,Z))
}

exp4 <- function(){
  CJ(X,Y,Z)[,prod := f(X,Y,Z)]
}

microbenchmark(exp1(), exp2(), exp3(), exp4(), times = 25)
#> Unit: microseconds
#>    expr       min        lq       mean    median        uq       max neval cld
#>  exp1() 28588.667 29375.543 31602.4207 30595.001 32012.667 44312.917    25   c
#>  exp2()  3129.459  3254.584  3498.8607  3285.625  3341.584  6549.251    25  b 
#>  exp3()  1334.209  1411.834  2072.8457  1691.376  1825.626  9790.917    25 ab 
#>  exp4()   357.501   403.084   829.6241   481.459   559.668  7319.542    25 a

If we increase the length of the variables to 30 from 10, we can see that things change a bit. CJ() is about twice as fast as expand_grid() which is in turn about twice as fast as expand.grid().

X <- seq(1,30)
Y <- seq(11,40)
Z <- seq(21,50)


microbenchmark(
  expand_grid(X,Y,Z), 
  expand.grid(X,Y,Z),
  CJ(X,Y,Z), times=25)
#> Unit: microseconds
#>                  expr     min      lq     mean  median      uq      max neval
#>  expand_grid(X, Y, Z) 298.834 321.542 485.6574 429.084 453.626 2516.376    25
#>  expand.grid(X, Y, Z) 610.584 729.084 750.1874 785.167 796.834  813.584    25
#>           CJ(X, Y, Z) 132.542 162.750 190.0006 201.292 214.126  260.959    25
#>  cld
#>   b 
#>    c
#>  a

When looking at the different ways of computing the function, the results are equally clear. The data.table solution is faster than the dplyr solution by a factor of about 3, though again not statistically significantly different. Both the dplyr and data.table solutions are significantly quicker than either solution that uses `expand.grid().

microbenchmark(exp1(), exp2(), exp3(), exp4(), times = 25)
#> Unit: microseconds
#>    expr         min          lq         mean      median          uq
#>  exp1() 1603137.001 1652454.959 1702111.1240 1666486.751 1699009.251
#>  exp2()   85320.126   89976.043   92332.3890   92130.709   94564.959
#>  exp3()    1604.750    1708.334    2302.5940    2199.292    2334.959
#>  exp4()     505.417     541.418     699.1174     723.834     793.001
#>          max neval cld
#>  2110589.459    25   c
#>   101803.250    25  b 
#>     4462.126    25 a  
#>      971.501    25 a

^{Created on 2023-01-24 by the reprex package (v2.0.1)}