Home > Back-end >  How to generate a list-column holding named-vectors, when grouping by other data frame variables?
How to generate a list-column holding named-vectors, when grouping by other data frame variables?

Time:09-28

Having a data frame, I want to generate a new list-column containing named vector(s) (one vector per row). Each vector derives its names and values from 2 other data frame columns. But I'm stuck because I want to do it:

  • by group
  • as computationally-efficient as possible

Example

Let's take mpg dataset from {ggplot2} to illustrate the by group principle. I want to lump together pairs of cty and hwy values, grouped by distinct combinations of manufacturer & year. So we can do:

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)

my_mpg <-
  mpg %>%
  select(manufacturer, year, cty, hwy)

via_tidyr_nest <- 
  my_mpg %>%
  group_by(manufacturer, year) %>%
  nest()

via_tidyr_nest
#> # A tibble: 30 x 3
#> # Groups:   manufacturer, year [30]
#>    manufacturer  year data             
#>    <chr>        <int> <list>           
#>  1 audi          1999 <tibble [9 x 2]> 
#>  2 audi          2008 <tibble [9 x 2]> 
#>  3 chevrolet     2008 <tibble [12 x 2]>
#>  4 chevrolet     1999 <tibble [7 x 2]> 
#>  5 dodge         1999 <tibble [16 x 2]>
#>  6 dodge         2008 <tibble [21 x 2]>
#>  7 ford          1999 <tibble [15 x 2]>
#>  8 ford          2008 <tibble [10 x 2]>
#>  9 honda         1999 <tibble [5 x 2]> 
#> 10 honda         2008 <tibble [4 x 2]> 
#> # ... with 20 more rows

Created on 2021-09-27 by the reprex package (v0.3.0)

This is perfect except that I don't want a nested tibble but a nested named vector. (the reason: once we store the output as an object in the environment, the named vector version is lighter in size than the nested tibble one).

The working but undesired solution would take via_tidyr_nest and convert the nested tibble into a named vector.

expected_output <-
  via_tidyr_nest %>%
  mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
  select(-data)

expected_output
#> # A tibble: 30 x 3
#> # Groups:   manufacturer, year [30]
#>    manufacturer  year desired_named_vec
#>    <chr>        <int> <list>           
#>  1 audi          1999 <int [9]>        
#>  2 audi          2008 <int [9]>        
#>  3 chevrolet     2008 <int [12]>       
#>  4 chevrolet     1999 <int [7]>        
#>  5 dodge         1999 <int [16]>       
#>  6 dodge         2008 <int [21]>       
#>  7 ford          1999 <int [15]>       
#>  8 ford          2008 <int [10]>       
#>  9 honda         1999 <int [5]>        
#> 10 honda         2008 <int [4]>        
#> # ... with 20 more rows

This is undesired because it achieves the desired output via a detour. First it creates a tibble and then it converts to a named vector. While processing time is negligible in this example, in reality I have a large dataset (10 million rows). Thus, adding any extra step is costly. Instead, I wish to arrive at expected_output with as fewest steps as possible.


One unsuccessful attempt:

library(purrr)

via_summarise_map2_setnames <- 
  my_mpg %>%
  group_by(manufacturer, year) %>%
  summarise(named_vec = map2(.x = cty, .y = hwy, .f = ~setNames(.x, .y))) 
#> `summarise()` has grouped output by 'manufacturer', 'year'. You can override using the `.groups` argument.

via_summarise_map2_setnames
#> # A tibble: 234 x 3
#> # Groups:   manufacturer, year [30]
#>    manufacturer  year named_vec
#>    <chr>        <int> <list>   
#>  1 audi          1999 <int [1]>
#>  2 audi          1999 <int [1]>
#>  3 audi          1999 <int [1]>
#>  4 audi          1999 <int [1]>
#>  5 audi          1999 <int [1]>
#>  6 audi          1999 <int [1]>
#>  7 audi          1999 <int [1]>
#>  8 audi          1999 <int [1]>
#>  9 audi          1999 <int [1]>
#> 10 audi          2008 <int [1]>
#> # ... with 224 more rows

Any idea how to go from my_mpg to expected_output directly, without creating a tibble in-between?


EDIT


Just a general thought in the context of this question. I understand that the default behavior of tidyr::nest() is to return a nested tibble. But I didn't find any discussion of this decision. In other words, what if we wanted to choose ourselves the class of the nested data? It could be either a tibble as default, or otherwise a data.frame, data.table, named vector, etc. Whatever the user chooses as the output class.


EDIT 2


If anyone can think of a data.table solution, it would be very helpful.

CodePudding user response:

dplyr::group_modify() and tibble::deframe() can be used here. Instead of deframe() the pull(x, cty, hwy) from your question would work just the same.

library(tidyverse)
mpg |>
  select(manufacturer, year, cty, hwy) |>
  group_by(manufacturer, year)  |>
  group_modify(\(x, ...) tibble(res = list(deframe(x))))
#> # A tibble: 30 × 3
#> # Groups:   manufacturer, year [30]
#>    manufacturer  year res       
#>    <chr>        <int> <list>    
#>  1 audi          1999 <int [9]> 
#>  2 audi          2008 <int [9]> 
#>  3 chevrolet     1999 <int [7]> 
#>  4 chevrolet     2008 <int [12]>
#>  5 dodge         1999 <int [16]>
#>  6 dodge         2008 <int [21]>
#>  7 ford          1999 <int [15]>
#>  8 ford          2008 <int [10]>
#>  9 honda         1999 <int [5]> 
#> 10 honda         2008 <int [4]> 
#> # … with 20 more rows

CodePudding user response:

edited : replaced 'map' by 'Map'

I hope this can be useful. Your solution is within 'f', my proposal within 'g'. It uses the index created by dplyr's 'group_by' to collect the needed data to build the named vectors.

f <- function () {
via_tidyr_nest <- 
  my_mpg %>%
  group_by(manufacturer, year) %>%
  nest()
expected_output <-
  via_tidyr_nest %>%
  mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
  select(-data)
}

g <- function () {
df1 <- my_mpg %>% group_by(manufacturer, year)
df2 <- attr(df1,"groups")
Map(function(rows) {
      r <- df1[rows,"cty",drop=TRUE]
      setNames(r,df1[rows,"hwy",drop=TRUE])
    },
    df2$.rows
  ) -> l
df <- data.frame(manufacturer=df2$manufacturer,year=df2$year,named_vector=I(l))
}

# other solutions
h <- function () {
hdf <- my_mpg %>%
  group_by(manufacturer, year) %>%
  summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)))
}

k <- function() {
mpg |>
  select(manufacturer, year, cty, hwy) |>
  group_by(manufacturer, year)  |>
  group_modify(\(x, ...) tibble(res = list(deframe(x))))
}

library(microbenchmark)
microbenchmark(OP=f(),Nicolas2=g(),Rui=h(),Till=k())
Unit: milliseconds
     expr     min       lq      mean   median       uq      max neval
       OP 21.8917 22.64035 24.389126 23.28235 24.70075  39.9593   100
 Nicolas2  3.0507  3.15920  3.481469  3.24625  3.57840   7.3173   100
      Rui  6.5460  6.75300  7.505564  7.16255  7.64390  12.0359   100
     Till 31.2364 32.31115 34.940356 32.92990 36.11505 107.2709   100

CodePudding user response:

Here is a way. Coerce cty and hwy to "list" before setting the names. It seems to work.

library(purrr)
library(dplyr)

data(mpg, package = "ggplot2")
my_mpg <-
  mpg %>%
  select(manufacturer, year, cty, hwy)

my_mpg %>%
  group_by(manufacturer, year) %>%
  summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)))
#`summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
## A tibble: 30 x 3
## Groups:   manufacturer [15]
#   manufacturer  year named_vec 
#   <chr>        <int> <list>    
# 1 audi          1999 <int [9]> 
# 2 audi          2008 <int [9]> 
# 3 chevrolet     1999 <int [7]> 
# 4 chevrolet     2008 <int [12]>
# 5 dodge         1999 <int [16]>
# 6 dodge         2008 <int [21]>
# 7 ford          1999 <int [15]>
# 8 ford          2008 <int [10]>
# 9 honda         1999 <int [5]> 
#10 honda         2008 <int [4]> 
## … with 20 more rows

Benchmarks

Since the question is a performance question, here are the benchmarks of the 4 proposed solutions, so far, the question's, Nicolas2's, Till's and mine above.

f <- function(X) {
  X %>%
    group_by(manufacturer, year) %>%
    nest() %>%
    mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
    select(-data)
}

g <- function(X) {
  df1 <- X %>% group_by(manufacturer, year)
  df2 <- attr(df1,"groups")
  Map(function(rows) {
    r <- df1[rows,"cty",drop=TRUE]
    setNames(r,df1[rows,"hwy",drop=TRUE])
  },
  df2$.rows
  ) -> l
  data.frame(manufacturer=df2$manufacturer,year=df2$year,named_vector=I(l))
}
h <- function(X){
  X %>%
    group_by(manufacturer, year) %>%
    summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)), .groups = "drop")
}
i <- function(X){
  X |>
    select(manufacturer, year, cty, hwy) |>
    group_by(manufacturer, year)  |>
    group_modify(\(x, ...) tibble(res = list(deframe(x))))
}

mb <- microbenchmark(
  Emman = f(my_mpg),
  Nicolas2 = g(my_mpg),
  Rui = h(my_mpg),
  Till = i(my_mpg)
)
print(mb, unit = "relative", order = "median")
#Unit: relative
#     expr      min       lq     mean   median       uq      max neval  cld
#      Rui 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100 a   
# Nicolas2 1.527957 1.468524 1.478286 1.482185 1.471565 1.724004   100  b  
#    Emman 4.504185 4.230921 4.215643 4.234087 4.148188 4.170934   100   c 
#     Till 6.264028 5.813678 5.883107 5.810876 5.744080 5.666524   100    d
  • Related