Home > Net >  Is there a speedier way to nest data by-group?
Is there a speedier way to nest data by-group?

Time:11-11

I'm looking for a fast way to summarize a dataset:

  • by-group
  • collapsing rows by nesting values

For example, consider the mpg dataset from ggplot2 package. Here we group by manufacturer and year, and nest cty values.

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)

mpg %>%
  group_by(manufacturer, year) %>%
  summarise(nest_cty = list(cty))
#> `summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
#> # A tibble: 30 x 3
#> # Groups:   manufacturer [15]
#>    manufacturer  year nest_cty  
#>    <chr>        <int> <list>    
#>  1 audi          1999 <int [9]> 
#>  2 audi          2008 <int [9]> 
#>  3 chevrolet     1999 <int [7]> 
#>  4 chevrolet     2008 <int [12]>
#>  5 dodge         1999 <int [16]>
#>  6 dodge         2008 <int [21]>
#>  7 ford          1999 <int [15]>
#>  8 ford          2008 <int [10]>
#>  9 honda         1999 <int [5]> 
#> 10 honda         2008 <int [4]> 
#> # ... with 20 more rows

There are alternative packages that can give us some performance gain in speed:

  • data.table

    library(ggplot2)
    library(data.table)
    MPG <- data.table(mpg)
    MPG[, .(nest_cty = list(cty)), by = list(manufacturer, year)]
    
  • collapse

    library(ggplot2)
    library(collapse, warn.conflicts = FALSE)
    #> collapse 1.6.5, see ?`collapse-package` or ?`collapse-documentation`
    #> Note: stats::D  ->  D.expression, D.call, D.name
    library(magrittr)
    
    mpg %>%
      fgroup_by(manufacturer, year) %>%
      fsummarise(nest_cty = list(cty))
    

Evidently, collapse gives the speediest execution time:

bench::mark(dplyr = mpg %>% group_by(manufacturer, year) %>% summarise(nest_cty = list(cty)),
            data.table = {MPG <- data.table(mpg); MPG[, .(nest_cty = list(cty)), by = list(manufacturer, year)] },
            collapse = mpg %>% fgroup_by(manufacturer, year) %>% fsummarise(nest_cty = list(cty)),
            check = FALSE)
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr        4.69ms    5.5ms      184.    2.38MB     5.56
#> 2 data.table   2.37ms   2.51ms      391.    2.16MB     0   
#> 3 collapse     95.2us  101.8us     9560.  206.56KB     6.22

The problem

When we have very large data, even collapse is not as speedy as I wish
Below I simulate some large data and show timings.

code for data simulation function; no need to read

library(ids)
library(tidyr)

simulate_data <- function() {
  
  # util function
  generate_random_unique_ids <- function(n) {
    vec_ids <- ids::random_id(n = n, bytes = 4, use_openssl = FALSE)
    repeat {
      duplicates <- duplicated(vec_ids)
      if (!any(duplicates)) {
        break
      }
      vec_ids[duplicates] <- ids::random_id(n = sum(duplicates), bytes = 4, use_openssl = FALSE)
    }
    vec_ids
  }
  
  ids        <- generate_random_unique_ids(4e5)
  years      <- 1996:2022
  
  d         <- tidyr::expand_grid(ids, years)
  d$weights <- sample(5, nrow(d), replace = TRUE)
  d         <- tidyr::uncount(d, weights = weights) 
  d$vals    <- sample(letters, nrow(d), replace = TRUE)
  d         <- d[-sample(nrow(d), 0.5 * nrow(d), replace = TRUE), ]
  
  return(d)
}

simulate the data

set.seed(2021)
my_df <- simulate_data()
my_df
#> # A tibble: 19,650,398 x 3
#>    ids      years vals 
#>    <chr>    <int> <chr>
#>  1 86a21d40  1996 c    
#>  2 86a21d40  1996 h    
#>  3 86a21d40  1996 v    
#>  4 86a21d40  1997 m    
#>  5 86a21d40  1997 f    
#>  6 86a21d40  1997 g    
#>  7 86a21d40  1997 b    
#>  8 86a21d40  1998 i    
#>  9 86a21d40  1998 d    
#> 10 86a21d40  1998 j    
#> # ... with 19,650,388 more rows

We can see that my_df contains 3 columns in long format.


Bottom line:
Even with collapse, grouping by ids and years then nesting vals is going to take quite some time.

library(collapse)
library(tictoc)

tic()
output <-
  my_df %>%
  fgroup_by(ids, years) %>%
  fsummarise(nested_vals = list(vals))
toc()
#> 20.15 sec elapsed
output
#> # A tibble: 9,411,683 x 3
#>    ids      years nested_vals
#>    <chr>    <int> <list>     
#>  1 00000466  1996 <chr [3]>  
#>  2 00000466  1997 <chr [1]>  
#>  3 00000466  1998 <chr [1]>  
#>  4 00000466  1999 <chr [1]>  
#>  5 00000466  2001 <chr [1]>  
#>  6 00000466  2002 <chr [4]>  
#>  7 00000466  2003 <chr [1]>  
#>  8 00000466  2004 <chr [3]>  
#>  9 00000466  2005 <chr [4]>  
#> 10 00000466  2006 <chr [1]>  
#> # ... with 9,411,673 more rows

I tend to work with very large data, and processing times easily climb for such "nesting" procedure. Is there possibly a speedier method than the collapse example?

CodePudding user response:

An option with collap

library(collapse)
output <- collap(my_df, ~ ids   years, custom = list(list = 3))
  • Related