Suppose I have a function of three variables such as
f <- function(x,y,z) x*y*z
Now, given three data sets
X <- seq(1,10)
Y <- seq(11,20)
Z <- seq(21,30)
I want to efficiently evaluate f
on all combinations of elements from X,Y,Z
. What is the smartest way to do this? (In practice, my function is more complicated and the sets are very large).
CodePudding user response:
Maybe this. Using prod
to get the product of the elements by lines from expand.grid
.
X <- 1:3
Y <- 4:6
Z <- 7:9
apply(expand.grid(X, Y, Z), 1, function(x) prod(x))
[1] 28 56 84 35 70 105 42 84 126 32 64 96 40 80 120 48 96 144 36
[20] 72 108 45 90 135 54 108 162
Using an arbitrary function
f <- function(x, y, z) x * y * z
apply(expand.grid(X, Y, Z), 1, function(x) f(x[1], x[2], x[3]))
[1] 28 56 84 35 70 105 42 84 126 32 64 96 40 80 120 48 96 144 36
[20] 72 108 45 90 135 54 108 162
With dplyr
, using crossing
Note the difference between crossing
and expand_grid
/expand.grid
‘crossing()’ is a wrapper around ‘expand_grid()’ that de-duplicates and sorts its inputs
So its naturally a bit slower, gaining a functionality often needed.
library(dplyr)
library(tidyr)
crossing(X, Y, Z) %>%
rowwise() %>%
mutate(result = f(X, Y, Z))
# A tibble: 27 × 4
# Rowwise:
X Y Z result
<int> <int> <int> <int>
1 1 4 7 28
2 1 4 8 32
3 1 4 9 36
4 1 5 7 35
5 1 5 8 40
6 1 5 9 45
7 1 6 7 42
8 1 6 8 48
9 1 6 9 54
10 2 4 7 56
# … with 17 more rows
And finally a data.table
approach using CJ
‘CJ’ : * C *ross * J *oin. A ‘data.table’ is formed from the cross product of the vectors.
library(data.table)
CJ(X, Y, Z)[, f(X, Y, Z)]
[1] 28 32 36 35 40 45 42 48 54 56 64 72 70 80 90 84 96 108 84
[20] 96 108 105 120 135 126 144 162
CodePudding user response:
Create combinations using expand.grid
and iterate over the resulting data frame:
f <- function(x,y,z) x*y*z
X <- seq(1,10)
Y <- seq(11,20)
Z <- seq(21,30)
df = expand.grid(X, Y, Z)
for (i in 1:nrow(df)) {
print(f(df[i, 1], df[i, 2], df[i, 3]))
}
CodePudding user response:
I think the right answer probably depends on the size of X
, Y
and Z
. There are a few options for expanding all combinations of values, expand.grid()
from base R and expand_grid()
from the tidyr
package as well as CJ()
from the data.table
package as @Andre Wildberg mentioned. You could then compute the results of the function with either a for loop over the rows of the expanded dataset, apply()
, with mutate()
in a pipeline or with a data.table
approach. Depending on the size these solutions have different properties. First, consider the situation posed above where each is of length 10. Looking at the benchmarks for expand.grid()
vs expand_grid()
vs CJ()
, they are of similar magnitude, though the expand.grid()
and CJ()
approaches are faster, on average, than the expand_grid()
approach.
library(dplyr)
library(tidyr)
library(data.table)
library(microbenchmark)
f <- function(x,y,z) x*y*z
X <- seq(1,10)
Y <- seq(11,20)
Z <- seq(21,30)
microbenchmark(
expand_grid(X,Y,Z),
expand.grid(X,Y,Z),
CJ(X, Y, Z), times=25)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> expand_grid(X, Y, Z) 245.750 259.667 361.7473 269.126 282.834 2507.043 25
#> expand.grid(X, Y, Z) 102.750 107.084 147.9838 112.584 122.625 952.251 25
#> CJ(X, Y, Z) 115.293 124.168 202.5123 132.209 137.251 1885.709 25
#> cld
#> a
#> a
#> a
When considering the different ways of computing the result I've implemented four solutions:
exp1()
is the for loop withexpand.grid()
exp2()
is theapply()
withexpand.grid()
.exp3()
is thedplyr
solution withexpand_grid()
exp4()
is thedata.table
solution withCJ()
.
The data.table
solution is the clear winner here beating the dplyr
solution by a factor of about 3 (though not a statistically significant difference according to the CLD).
exp1 <- function(){
df = expand.grid(X, Y, Z)
for (i in 1:nrow(df)) {
df$prod = f(df[i,1], df[i,2], df[i,3])
}
}
exp2 <- function(){
df = expand.grid(X, Y, Z)
df$prod <- apply(df, 1, function(x)f(x[1], x[2], x[3]))
}
exp3 <- function(){
df = expand_grid(X,Y,Z)
df %>% mutate(prod = f(X,Y,Z))
}
exp4 <- function(){
CJ(X,Y,Z)[,prod := f(X,Y,Z)]
}
microbenchmark(exp1(), exp2(), exp3(), exp4(), times = 25)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> exp1() 28588.667 29375.543 31602.4207 30595.001 32012.667 44312.917 25 c
#> exp2() 3129.459 3254.584 3498.8607 3285.625 3341.584 6549.251 25 b
#> exp3() 1334.209 1411.834 2072.8457 1691.376 1825.626 9790.917 25 ab
#> exp4() 357.501 403.084 829.6241 481.459 559.668 7319.542 25 a
If we increase the length of the variables to 30 from 10, we can see that things change a bit. CJ()
is about twice as fast as expand_grid()
which is in turn about twice as fast as expand.grid()
.
X <- seq(1,30)
Y <- seq(11,40)
Z <- seq(21,50)
microbenchmark(
expand_grid(X,Y,Z),
expand.grid(X,Y,Z),
CJ(X,Y,Z), times=25)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> expand_grid(X, Y, Z) 298.834 321.542 485.6574 429.084 453.626 2516.376 25
#> expand.grid(X, Y, Z) 610.584 729.084 750.1874 785.167 796.834 813.584 25
#> CJ(X, Y, Z) 132.542 162.750 190.0006 201.292 214.126 260.959 25
#> cld
#> b
#> c
#> a
When looking at the different ways of computing the function, the results are equally clear. The data.table
solution is faster than the dplyr
solution by a factor of about 3, though again not statistically significantly different. Both the dplyr
and data.table
solutions are significantly quicker than either solution that uses `expand.grid().
microbenchmark(exp1(), exp2(), exp3(), exp4(), times = 25)
#> Unit: microseconds
#> expr min lq mean median uq
#> exp1() 1603137.001 1652454.959 1702111.1240 1666486.751 1699009.251
#> exp2() 85320.126 89976.043 92332.3890 92130.709 94564.959
#> exp3() 1604.750 1708.334 2302.5940 2199.292 2334.959
#> exp4() 505.417 541.418 699.1174 723.834 793.001
#> max neval cld
#> 2110589.459 25 c
#> 101803.250 25 b
#> 4462.126 25 a
#> 971.501 25 a
Created on 2023-01-24 by the reprex package (v2.0.1)