My understanding is that base::grid.expand()
and tidyr::grid_expand()
will return an object with a row for each unique value of the joint distribution of unique values across one or more vectors. For example, here is what I expect:
# Preliminaries
library(tidyr)
set.seed(123)
# Simulate data
df <- data.frame(x = as.factor(rep(c(1,2), 50)), y= as.factor(sample(1:3, 100, replace = T)))
# Expected result
data.frame(x = rep(1:2, 3), y = rep(1:3, 2)) # 6 rows!
However, when I actually use the functions, I get many more (duplicated) rows than I expect:
# Tidyverse result
tidyr::expand_grid(df) # produces 100 rows!
tidyr::expand_grid(df$x, df$y) # produces 10k rows!
# Base R version
base::expand.grid(df) # produces 10k rows!
base::expand.grid(df$x, df$y) # produces 10k rows!
# Solution...but why do I have to do this?!
unique(base::expand.grid(df))
Can someone explain what I am missing about what it is supposed to do?
CodePudding user response:
The input to expand_grid
is variadic (...
), we can use do.call
do.call(expand_grid, df)
Or with invoke
library(purrr)
invoke(expand_grid, df)
# A tibble: 10,000 × 2
x y
<fct> <fct>
1 1 3
2 1 3
3 1 3
4 1 2
5 1 3
6 1 2
7 1 2
8 1 2
9 1 3
10 1 1
# … with 9,990 more rows
Or with !!!
expand_grid(!!! df)
# A tibble: 10,000 × 2
x y
<fct> <fct>
1 1 3
2 1 3
3 1 3
4 1 2
5 1 3
6 1 2
7 1 2
8 1 2
9 1 3
10 1 1
# … with 9,990 more rows
As @Mossa commented, the function to return unique combinations would be expand
or crossing
because expand
calls expand_grid
on unique
values
> expand(df, df)
# A tibble: 6 × 2
x y
<fct> <fct>
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
Based on the source code
getAnywhere("expand.data.frame")
function (data, ..., .name_repair = "check_unique")
{
out <- grid_dots(..., `_data` = data)
out <- map(out, sorted_unique)
out <- expand_grid(!!!out, .name_repair = .name_repair)
reconstruct_tibble(data, out)
}
CodePudding user response:
expand.grid
makes no attempt to return only unique values of the input vectors. It will always output a data frame which has a number of rows that is the same as the product of the length of its input vectors:
nrow(expand.grid(1:10, 1:10, 1:10))
#> [1] 1000
nrow(expand.grid(1, 1, 1, 1, 1, 1, 1, 1, 1))
#> [1] 1
If you look at the source code for expand.grid
, it takes the variadic dots and turns them into a list called args
. It then includes the line:
d <- lengths(args)
which returns a vector with one entry for each vector that we feed into expand.grid
. In the case of expand.grid(df$x, df$y)
, d
would be equivalent to c(100, 100)
.
There then follows the line
orep <- prod(d)
which gives us the product of d
, which is 100x100, or 10,000.
The variable orep
is used later in the function to repeat each vector so that its length is equal to the value orep
.
If you only want unique combinations of the two input vectors, then you must make them unique
at the input to expand.grid
.