I'm looking for a fast way to summarize a dataset:
- by-group
- collapsing rows by nesting values
For example, consider the mpg
dataset from ggplot2
package. Here we group by manufacturer
and year
, and nest cty
values.
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
mpg %>%
group_by(manufacturer, year) %>%
summarise(nest_cty = list(cty))
#> `summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
#> # A tibble: 30 x 3
#> # Groups: manufacturer [15]
#> manufacturer year nest_cty
#> <chr> <int> <list>
#> 1 audi 1999 <int [9]>
#> 2 audi 2008 <int [9]>
#> 3 chevrolet 1999 <int [7]>
#> 4 chevrolet 2008 <int [12]>
#> 5 dodge 1999 <int [16]>
#> 6 dodge 2008 <int [21]>
#> 7 ford 1999 <int [15]>
#> 8 ford 2008 <int [10]>
#> 9 honda 1999 <int [5]>
#> 10 honda 2008 <int [4]>
#> # ... with 20 more rows
There are alternative packages that can give us some performance gain in speed:
data.table
library(ggplot2) library(data.table) MPG <- data.table(mpg) MPG[, .(nest_cty = list(cty)), by = list(manufacturer, year)]
collapse
library(ggplot2) library(collapse, warn.conflicts = FALSE) #> collapse 1.6.5, see ?`collapse-package` or ?`collapse-documentation` #> Note: stats::D -> D.expression, D.call, D.name library(magrittr) mpg %>% fgroup_by(manufacturer, year) %>% fsummarise(nest_cty = list(cty))
Evidently, collapse
gives the speediest execution time:
bench::mark(dplyr = mpg %>% group_by(manufacturer, year) %>% summarise(nest_cty = list(cty)),
data.table = {MPG <- data.table(mpg); MPG[, .(nest_cty = list(cty)), by = list(manufacturer, year)] },
collapse = mpg %>% fgroup_by(manufacturer, year) %>% fsummarise(nest_cty = list(cty)),
check = FALSE)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 4.69ms 5.5ms 184. 2.38MB 5.56
#> 2 data.table 2.37ms 2.51ms 391. 2.16MB 0
#> 3 collapse 95.2us 101.8us 9560. 206.56KB 6.22
The problem
When we have very large data, even collapse
is not as speedy as I wish
Below I simulate some large data and show timings.
code for data simulation function; no need to read
library(ids)
library(tidyr)
simulate_data <- function() {
# util function
generate_random_unique_ids <- function(n) {
vec_ids <- ids::random_id(n = n, bytes = 4, use_openssl = FALSE)
repeat {
duplicates <- duplicated(vec_ids)
if (!any(duplicates)) {
break
}
vec_ids[duplicates] <- ids::random_id(n = sum(duplicates), bytes = 4, use_openssl = FALSE)
}
vec_ids
}
ids <- generate_random_unique_ids(4e5)
years <- 1996:2022
d <- tidyr::expand_grid(ids, years)
d$weights <- sample(5, nrow(d), replace = TRUE)
d <- tidyr::uncount(d, weights = weights)
d$vals <- sample(letters, nrow(d), replace = TRUE)
d <- d[-sample(nrow(d), 0.5 * nrow(d), replace = TRUE), ]
return(d)
}
simulate the data
set.seed(2021)
my_df <- simulate_data()
my_df
#> # A tibble: 19,650,398 x 3
#> ids years vals
#> <chr> <int> <chr>
#> 1 86a21d40 1996 c
#> 2 86a21d40 1996 h
#> 3 86a21d40 1996 v
#> 4 86a21d40 1997 m
#> 5 86a21d40 1997 f
#> 6 86a21d40 1997 g
#> 7 86a21d40 1997 b
#> 8 86a21d40 1998 i
#> 9 86a21d40 1998 d
#> 10 86a21d40 1998 j
#> # ... with 19,650,388 more rows
We can see that my_df
contains 3 columns in long format.
Bottom line:
Even with collapse
, grouping by ids
and years
then nesting vals
is going to take quite some time.
library(collapse)
library(tictoc)
tic()
output <-
my_df %>%
fgroup_by(ids, years) %>%
fsummarise(nested_vals = list(vals))
toc()
#> 20.15 sec elapsed
output
#> # A tibble: 9,411,683 x 3
#> ids years nested_vals
#> <chr> <int> <list>
#> 1 00000466 1996 <chr [3]>
#> 2 00000466 1997 <chr [1]>
#> 3 00000466 1998 <chr [1]>
#> 4 00000466 1999 <chr [1]>
#> 5 00000466 2001 <chr [1]>
#> 6 00000466 2002 <chr [4]>
#> 7 00000466 2003 <chr [1]>
#> 8 00000466 2004 <chr [3]>
#> 9 00000466 2005 <chr [4]>
#> 10 00000466 2006 <chr [1]>
#> # ... with 9,411,673 more rows
I tend to work with very large data, and processing times easily climb for such "nesting" procedure. Is there possibly a speedier method than the collapse
example?
CodePudding user response:
An option with collap
library(collapse)
output <- collap(my_df, ~ ids years, custom = list(list = 3))