R: Why is expand.grid() producing many more rows than I expect?-CodePudding

My understanding is that base::grid.expand() and tidyr::grid_expand() will return an object with a row for each unique value of the joint distribution of unique values across one or more vectors. For example, here is what I expect:

# Preliminaries
library(tidyr)
set.seed(123)

# Simulate data
df <- data.frame(x = as.factor(rep(c(1,2), 50)), y= as.factor(sample(1:3, 100, replace = T)))

# Expected result
data.frame(x = rep(1:2, 3), y = rep(1:3, 2)) # 6 rows!

However, when I actually use the functions, I get many more (duplicated) rows than I expect:

# Tidyverse result
tidyr::expand_grid(df) # produces 100 rows!
tidyr::expand_grid(df$x, df$y) # produces 10k rows!

# Base R version
base::expand.grid(df) # produces 10k rows!
base::expand.grid(df$x, df$y) # produces 10k rows!

# Solution...but why do I have to do this?!
unique(base::expand.grid(df))

Can someone explain what I am missing about what it is supposed to do?

CodePudding user response：

The input to expand_grid is variadic (...), we can use do.call

do.call(expand_grid, df)

Or with invoke

library(purrr)
invoke(expand_grid, df)
# A tibble: 10,000 × 2
   x     y    
   <fct> <fct>
 1 1     3    
 2 1     3    
 3 1     3    
 4 1     2    
 5 1     3    
 6 1     2    
 7 1     2    
 8 1     2    
 9 1     3    
10 1     1    
# … with 9,990 more rows

Or with !!!

expand_grid(!!! df)
# A tibble: 10,000 × 2
   x     y    
   <fct> <fct>
 1 1     3    
 2 1     3    
 3 1     3    
 4 1     2    
 5 1     3    
 6 1     2    
 7 1     2    
 8 1     2    
 9 1     3    
10 1     1    
# … with 9,990 more rows

As @Mossa commented, the function to return unique combinations would be expand or crossing because expand calls expand_grid on unique values

> expand(df, df)
# A tibble: 6 × 2
  x     y    
  <fct> <fct>
1 1     1    
2 1     2    
3 1     3    
4 2     1    
5 2     2    
6 2     3

Based on the source code

getAnywhere("expand.data.frame")
function (data, ..., .name_repair = "check_unique") 
{
    out <- grid_dots(..., `_data` = data)
    out <- map(out, sorted_unique)
    out <- expand_grid(!!!out, .name_repair = .name_repair)
    reconstruct_tibble(data, out)
}

CodePudding user response：

expand.grid makes no attempt to return only unique values of the input vectors. It will always output a data frame which has a number of rows that is the same as the product of the length of its input vectors:

nrow(expand.grid(1:10, 1:10, 1:10))
#> [1] 1000

nrow(expand.grid(1, 1, 1, 1, 1, 1, 1, 1, 1))
#> [1] 1

If you look at the source code for expand.grid, it takes the variadic dots and turns them into a list called args. It then includes the line:

d <- lengths(args)

which returns a vector with one entry for each vector that we feed into expand.grid. In the case of expand.grid(df$x, df$y), d would be equivalent to c(100, 100).

There then follows the line

orep <- prod(d)

which gives us the product of d, which is 100x100, or 10,000.

The variable orep is used later in the function to repeat each vector so that its length is equal to the value orep.

If you only want unique combinations of the two input vectors, then you must make them unique at the input to expand.grid.